Authors:
Kai Alter, Max-Planck-Institute of Cognitive Neuroscience, Leipzig (Germany)
Karsten Steinhauer, Max-Planck-Institute of Cognitive Neuroscience, Leipzig (Germany)
Angela D. Friederici, Max-Planck-Institute of Cognitive Neuroscience, Leipzig (Germany)
Page (NA) Paper number 258
Abstract:
In this paper we present a preliminary speech production study concerning
the prosodic realization of the syntactic and information structure
in German. Firstly, we made predictions for the relative prominence
and their assignment with tonal patterns. Secondly, exhaustive acoustic
analysis were used to test the expectations. The data of a production
experiment with seven non-instructed normal subjects were analyzed
and then compared with the data of one patient with prosodic disorders.
Authors:
N. Amir, Center for Technological Education Holon (Israel)
S. Ron, Center for Technological Education Holon (Israel)
Page (NA) Paper number 199
Abstract:
In this paper we discuss a method for extracting emotional state from
the speech signal. We describe a methodology for obtaining emotive
speech. and a method for identifying the emotion present in the signal.
This method is based on analysis of the signal over sliding windows.
and extracting a representative parameter set. A set of basic emotions
is defined, and for each such emotion a reference point is computed.
At each instant the distance of the measured parameter set from the
reference points is calculated. and used to compute a fuzzy membership
index for each emotion, which we term the "emotional index". Preliminary
results are presented, which demonstrate the discriminative abilities
of this method when applied to a number of speakers.
Authors:
Marc Schröder, ICP (France)
Véronique Aubergé, ICP (France)
Marie-Agnès Cathiard, ICP (France)
Page (NA) Paper number 439
Abstract:
The amusement expression is both visual and audible in speech. After
recording comparable spontaneous, acted, mechanical, reiterated and
seduction stimuli, five perceptual experiments were held, mainly based
on the hypothesis of prosodically controlled effects of amusement on
speech. Results show that audio is partially independent from video,
which is as performant as audio-video. Spontaneous speech (involuntary
controlled) can be identified in front of acted speech (voluntary controlled).
Amusement speech can be distinguished from seduction speech.
Authors:
Matthew Aylett, Human Communication Research Centre, University of Edinburgh (U.K.)
Matthew Bull, Human Communication Research Centre (U.K.)
Page (NA) Paper number 825
Abstract:
The work reported in this paper was the result of the need to label
a large corpus of spontaneous, task-oriented dialogue with prosodic
prominences. A computational model using only word duration, part of
speech and a dictionary lookup of each word's canonical phonemic contents
was trained against the results of a human coder marking prominence.
Because word durations were normalised, it was possible to set a common
threshold for all members of a form class above which the lexically
stressed syllables were classed as prominent. The method used is presented
and the relative importance of duration information, phonemic contents,
syllabic context and part of speech information is explored. The automatic
coder was validated against unseen material and achieved a 58% agreement
with a human coder. Further investigation showed that three humans
coders agreed no better with each other than each agreed with the computational
model. Thus, although the automatic system did not conform very well
to the performance of any one human coder, it conformed as well as
another human coder might.
Authors:
JongDeuk Kim, Dept. of Telecommunication Engineering, Soongsil University (Korea)
SeongJoon Baek, School of Electrical Engineering, Seoul National University (Korea)
MyungJin Bae, Dept. of Telecommunication Engineering, Soongsil University (Korea)
Page (NA) Paper number 1020
Abstract:
In the area of the speech synthesis techniques, the waveform coding
methods maintain the intelligibility and naturalness of synthetic speech.
In order to apply the waveform coding or hybrid coding techniques to
synthesis by rule, we must be able to alter the pitches of synthetic
speech. In this paper, we propose a new pitch alteration method that
minimizes the spectrum distortion by using the behavior of cepstrum.
This method splits the spectrum of speech signal into excitation spectrum
and formant spectrum and transforms the excitation spectrum into cepstrum
domain. The pitch of excitation cepstrum is altered by zero insertion
or zero deletion and the pitch altered spectrum is reconstructed in
spectrum domain. As a result of performance test, the average spectrum
distortion was below 2.29% while that of conventional method is 2.47%.
Authors:
Jan Buckow, University of Erlangen (Germany)
Anton Batliner, University of Erlangen (Germany)
Richard Huber, University of Erlangen (Germany)
Jan Nöth, University of Erlangen (Germany)
Volker Warnke, University of Erlangen (Germany)
Heinrich Niemann, University of Erlangen (Germany)
Page (NA) Paper number 336
Abstract:
In VERBMOBIL we previously augmented the output of a word recognizer
with prosodic information. Here we present a new approach of interleaving
word recognition and prosodic processing. While we still use the output
of a word recognizer to determine phrase boundaries, we do not wait
until the end of the utterance before we start processing. Instead
we intercept chunks of word hypotheses during the forward search of
the recognizer. Neural networks and language models are used to predict
phrase boundaries. Those boundary hypotheses, in turn, are used by
the recognizer to cut the stream of incoming speech into syntactic-prosodic
phrases. Thus, incremental processing is possible. We investigate
which features are suited for incremental prosodic processing and compare
them w.r.t. classification performance and efficiency. We show that
with a set of features that can be computed efficiently classification
results are achieved which are almost as good as those with the previously
used computationally more expensive features.
Authors:
Janet E. Cahn, Massachusetts Institute of Technology (USA)
Page (NA) Paper number 991
Abstract:
This paper links prosody to the information in the text and how it
is processed by the speaker. It describes the operation and output
of Loq, a text-to-speech implementation that includes a model of limited
attention and working memory. Attentional limitations are key. Varying
the attentional parameter in the simulations varies in turn what counts
as given and new in a text, and therefore, the intonational contours
with which it is uttered. Currently, the system produces prosody in
three different styles: child-like, adult expressive, and knowledgeable.
This prosody also exhibits differences within each style -- no two
simulations are alike. The limited resource approach captures some
of the stylistic and individual variety found in natural prosody.
Authors:
Belinda Collins, Australian National University (Australia)
Page (NA) Paper number 695
Abstract:
This paper explores the existence and nature of accommodation processes
within conversation, particularly convergence of fundamental frequency
(Fo) of conversational participants over time. The study raises a nuumber
of issues related to methodologies for analysing interactional (typically
conversational) data. Most important is the issue of the applicability
of statistical sampling methods which are independent of the interactional
events occurring within the talk. It concludes with suggestions for
a methodology that examines long term acoustic phenomena (long term
Fo) and relates events at the micro acoustic level to interactional
events within a conversation.
Authors:
Hiroya Fujisaki, Department of Applied Electronics, Science University of Tokyo (Japan)
Sumio Ohno, Department of Applied Electronics, Science University of Tokyo (Japan)
Takashi Yagi, Department of Applied Electronics, Science University of Tokyo (Japan)
Takeshi Ono, Department of Applied Electronics, Science University of Tokyo (Japan)
Page (NA) Paper number 830
Abstract:
In order to test the validity of authors' command-response model for
the generation of F0 contours in the analysis and interpretation of
F0 contours of British English, F0 contours of utterances containing
sections that are not usually found in those of the common Japanese
are analyzed. The results indicate that these F0 contours can actually
be generated by selecting appropriate patterns of commands for phrase
and accent in a way that is specific to British English, while using
the same model for the mechanism of F0 control as for the common Japanese.
Authors:
Frode Holm, Speech Technology Laboratory (USA)
Kazue Hata, Speech Technology Laboratory (USA)
Page (NA) Paper number 1038
Abstract:
The task of generating natural human-sounding prosody for text-to-speech
(TTS) has historically been one of the most challenging problems that
researchers and developers have had to face. TTS systems have in general
become infamous for their "robotic" intonations. This paper describes
an approach to this problem which endeavors to capture as much detail
as possible from speech data, but in a way that avoids the "black boxes"
typical of neural networks and some vector clustering algorithms. Unlike
these latter methods, our approach may give feedback as to exactly
what the crucial parameters are that determine the successful choice
of pattern. Focusing on the notion of prosody templates, we confirmed
that a representative F0 and duration pattern can be extracted based
on stress pattern for a target proper noun which occurs in sentence-initial
position.
Authors:
Yasuo Horiuchi, Chiba University (Japan)
Akira Ichikawa, Chiba University (Japan)
Page (NA) Paper number 500
Abstract:
In this paper, we introduce a method of generating a prosodic tree
structure from the F0 contour of an utterance in order to analyze the
information expressed by prosody in Japanese spontaneous dialogue.
The connection rate, which means the strength of the relationship between
two prosodic units, is defined. By repeatedly combining the two adjacent
prosodic units where the rate is high, the tree structure is gradually
generated. To determine the parameters objectively, we applied the
principal component analysis to 32 dialogues from the Chiba Map Task
Dialogue Corpus. Then we applied our method to one dialogue. The results
suggested that the prosodic tree based on the first principal component
was concerned with the information telling what the speaker wanted
to do next and that the prosodic tree based on the second principal
component represented the syntactic and grammatical structure.
Authors:
Shunichi Ishihara, Japan Centre (Asian Studies), and Phonetics Laboratory, Department of Linguistics (Arts), The Australian National University (Australia)
Page (NA) Paper number 628
Abstract:
The Japanese dialect of Kagoshima (KJ) has two different surface pitch
patterns, (L)0 HL and (L)0 H. In this study, the properties of these
two surface pitch patterns of KJ will be acoustically-phonetically
described by means of z-score normalisation. Words consisting of two,
three, four and five syllables were used in this study (the syllable
structure is a CV) as test words, and the F0 of each syllable nucleus
was extracted and the raw F0 values were z-score normalised. Two native
speakers of KJ (one male and one female) participated in this study.
The tonal representation of KJ words will be discussed on the basis
of the z-score normalised results.
Authors:
Koji Iwano, Department of Information and Communication Engineering, School of Engineering, University of Tokyo (Japan)
Keikichi Hirose, Department of Information and Communication Engineering, School of Engineering, University of Tokyo (Japan)
Page (NA) Paper number 731
Abstract:
We have formerly proposed a statistical model of moraic transitions
of fundamental frequency (F0) contours and showed its effectiveness
for prosodic boundary detection and accent type recognition. This
model represented F0 contours of prosodic words to simultaneously detect
and recognize prosodic word boundaries and accent types. This paper
proposes a method where prosodic word F0 contours are modeled separately
according to their accent types and presence/absence of succeeding
pauses. An utterance is regarded as a sequence of prosodic words under
a simple grammar. Each moraic F0 contour is represented by a pair
of codes; the original shape code and the newly introduced delta code
representing the degree of F0 change between the mora in question and
its preceding mora. Compared with earlier results, the boundary detection
rate improves from 87.7% to 91.5%. Accent type recognition rate reached
76.0% (type 1 accent discrimination).
Authors:
Tae-Yeoub Jang, Centre for Speech Technology Research, University of Edinburgh (U.K.)
Minsuck Song, Department of English Language and Literature, Kwandong University (Korea)
Kiyeong Lee, Department of Electronic Communication Engineering, Kwandong University (Korea)
Page (NA) Paper number 547
Abstract:
The paper describes a research on a use of intonation for disambiguating
utterance types of Korean spoken sentences. Based on tilt intonation
theory (Taylor and Black 1994), two related but separate experiments
were performed at speaker independent level, both using the Hidden
Markov Model training technique. In the first experiment, a system
is established so that rough boundary positions of major intonation
events are detected. Subsequently the significant parameters are extracted
from the products of the first experiment, which are directly used
to train the final models for utterance type disambiguation. Results
show that the intonation contour can be used as a significant meaning
distinguisher in an automatic speech recognition system of Korean as
well as in a natural human communication system.
Authors:
Oliver Jokisch, Technical Acoustics Laboratory, Dresden University of Technology (Germany)
Diane Hirschfeld, Technical Acoustics Laboratory, Dresden University of Technology (Germany)
Matthias Eichner, Technical Acoustics Laboratory, Dresden University of Technology (Germany)
Rüdiger Hoffmann, Technical Acoustics Laboratory, Dresden University of Technology (Germany)
Page (NA) Paper number 855
Abstract:
This paper presents: a multi-level concept to generate the speech rhythm
in the Dresden TTS system for German (DreSS). The rhythm control includes
the phrase, the syllabic and the phonemic level. The concept allows
the alternative use of rule-based or statistical, but also data driven
methods on these levels. To create the rules and to train a neural
network, a new speech corpus from original speakers of the diphone-based
inventories has been recorded. The corpus covers texts and single utterances
and is subdivided into phrase, syllabic and phonemic databases. First
results indicating that the rule-based and the train-based methods
generate a comparable speech rhythm, if the databases are uniform.
The stepwise duration control on several prosodic levels shows promise
as a method of producing a flexible rhythm depending on the specific
TTS application.
Authors:
Jiangping Kong, EE Dept. of City University of Hong Kong (China)
Page (NA) Paper number 104
Abstract:
This paper concerns with the study of EGG (electroglottalgram by laryngograph)
model of ditoneme in Mandarin. The parameters for establishing models
are fundamental frequency (F0), which is regarded as reference, speed
quotient and open quotient, which are all extracted from the EGG signal
by using the software EGG.exe, an option of CSL, Model 4300B, KAY.
The result shows that speed quotient and open quotient have close
relationships with the F0 in different ditonemes. In general, speed
quotient and open quotient will decrease, when F0 increases in sustained
vowels. But in the ditonemes, speed quotient and open quotient show
different natures according to the position and environment. The conclusion
is that EGG models of ditonemes are composed of the patterns of F0,
speed quotient and open quotient in Mandarin.
Authors:
Geetha Krishnan, Carnegie Mellon University (USA)
Wayne Ward, Carnegie Mellon University (USA)
Page (NA) Paper number 930
Abstract:
In this study predictors of speech rate that are sensitive to local
and global rate changes, and relevant to different types of speakers,
were examined. Two groups of subjects, normal and disfluent speakers
(whose speech was clinically rated as "slow"), provided speech samples
at normal and fast rates. Samples were segmented into interstress
intervals (ISI) of varying length (i.e., varying number of syllables).
The compressibility of components within ISIs of varying length provided
information on local rate control strategies. The fast speech samples
were useful for examining strategies used in global rate increases.
Stressed vowels and intervowel intervals (IVI) showed similar trends
in compression for both speakers, for local and global rate increases.
We then investigated two measures of speech rate based on intervowel
intervals: the ratio measure (IVI/ISI) and the average IVI. High correlation
of average IVI with phone rate was found. Results of speech rate estimations
are presented.
Authors:
Haruo Kubozono, Kobe University (Japan)
Page (NA) Paper number 105
Abstract:
One of the major findings of the recent linguistic research on Japanese
is that the syllable plays a pivotal role in a variety of phonological
and morphological phenomena in the mora-based prosodic system of this
language. This paper attempts to reinforce this argument by proposing
a significant generalization of Japanese accentuation in terms of `syllable
weight', an idea that each syllable has a certain weight according
to its phonological configuration. Specifically, this analysis reveals
that Japanese accentuation is strikingly similar to that of Latin and
many languages with a Latin-type accent system, e.g. English. Moreover,
a sociolinguistic analysis of the accentual changes currently in progress
demonstrates that Japanese accentuation is becoming increasingly similar
to the Latin-type accent system, where the syllable plays a primary
role.
Authors:
Hyuck-Joon Lee, UCLA (USA)
Page (NA) Paper number 903
Abstract:
This paper investigates the degree to which an onset consonant of an
accentual phrase affects the f0 of the following syllables within the
phrase in Seoul Korean. Korean tense or aspirated onset consonants
raise f0 values of the following adjacent vowel, and when they are
positioned on the first syllable onset of an accentual phrase, they
continuously raise f0 values of the following non-adjacent vowels.
This f0 raising after aspirated or tense consonants supports the previous
claim that the microprosody in Korean is phonologized in phrase initial
position. The results also confirms the previous claim regarding the
location of the underlying 4 tones of the accentual phrase and the
interpolation hypothesis.
Authors:
Eduardo López, ETSIT-UPM (Spain)
Javier Caminero, Telefonica I+D (Spain)
Ismael Cortázar, Telefonica I+D (Spain)
Luis A. Hernández, ETSIT-UPM (Spain)
Page (NA) Paper number 353
Abstract:
In this paper we propose a strategy to improve the performance of a
connected number recognition system in Spanish using prosodic information.
Prosodic information is included as the detection of pitch movements
between what some studies of intonation in Spanish called melodic units.
The basic linguistic background of our approach together with the
specific strategies to detect and correct ambiguities and recognition
errors are discussed. Experimental results show a 16% of reduction
in recognition errors for our state-of-the-art connected number recognizer,
and the possibility to solve ambiguities unable to be considered by
the recognizer.
Authors:
Kazuaki Maeda, Univ of Pennsylvania and Bell Labs (USA)
Jennifer J. Venditti, Bell Labs and Ohio State Univ (USA)
Page (NA) Paper number 800
Abstract:
Pitch movements at the boundaries of sentence-medial and final phrases
in Japanese can provide a cue to the speaker's intention. For example,
the phrase /Nagano-de/ 'in Nagano' can be uttered with different rising
and/or falling pitch movements on the the final mora /de/ to convey
meanings such as clarification, incredulity, prominence in the discourse,
insistence, etc. The identification of these movements is important
not only for spoken language understanding systems, but also for natural-sounding
speech synthesis. The current study examines F0 shape, height, and
alignment characteristics of four distinct sentence-medial boundary
rises. We compare these types with accented and focused unaccented
words containing identical phonetic segments, and discuss a number
of different possible phonological analyses of the pitch movements.
Authors:
Kikuo Maekawa, The National Language Research Institute (Japan)
Page (NA) Paper number 997
Abstract:
Three Japanese sentences were uttered repeatedly by three speakers
with paralinguistic information indicating A(dmiration), D(issapointment),
F(ocus), I(ndifference), S(uspicion), and N(eutral). A perception test
using all recorded materials revealed that the average correct perception
rate was higher than 80 percent. Acoustic analyses revealed the following
phonetic characteristics: - Considerable elongation of utterance duration
in types A, D, and S. - The first and last morae were more elastic
in duration than the others. - Narrowed pitch range in type D and enlarged
pitch range in A, F, and S. - Characteristic low pitch at the beginning
of types A, D and S. - Delayed accentual peak location in types A,
D, and S. - Systematic distributional difference of sentence-final
vowels on the F1-F2 formant plane. - Seeming 'Laryngealization' in
the initial low-pitched portion of types S, A, and D.
0997_01.WAV
(was: 0997.wav)
| Typical utterances of sentence 1) uttered with paralinguistic
information types A,D,F,I,N, and S, plus an utterance of sentence 3)
type N.
File type: Sound File
Format: Sound File: WAV
Tech. description: 11025Hz-16bit samplling, mono
Creating Application:: Creative SoundStudio
Creating OS: Win95
|
0997_02.PDF
(was: 0997.gif)
| Instructions given to the speakers at the time of recording and
also to the subjects of perception test.
File type: Image File
Format: Image : GIF
Tech. description: None
Creating Application:: Imagetool on Solaris
Creating OS: Solaris 2.6
|
Authors:
Arman Maghbouleh, Stanford University (USA)
Page (NA) Paper number 632
Abstract:
This paper describes work in progress for recognizing a subset of ToBI
intonation labels (H*, L+H*, L*, !H*, L+!H*, no accent). Initially,
duration characteristics are used to classify syllables as accented
or not. The accented syllables are then subclassified based on fundamental
frequency, F0, values. Potential F0 intonation gestures are schematized
by connected line segments within a window around a given syllable.
The schematizations are found using spline-basis linear regression.
The regression weights on F0 points are varied in order to discount
segmental effects and F0 detection errors. Parameters based on the
line segments are then used to perform the subclassification. This
paper presents new results in recognizing L*, L+H*, and L+!H* accents.
In addition, the models presented here perform comparably (80% overall,
and 74% accent type recognition) to models which do not distinguish
bitonal accents.
Authors:
Hansjörg Mixdorff, Dresden University of Technology (Germany)
Hiroya Fujisaki, Science University of Tokyo (Japan)
Page (NA) Paper number 707
Abstract:
In earlier studies the authors developed a model of German intonation
based on the quantitative Fujisaki-model. The present study examines
the influence of the segmental structure of an accented syllable on
the timing of accent commands. It aims at developing refined alignment
rules for speech synthesis. The corpus consists of 67 three-syllable
words of German with word-accent on the second syllable produced by
three male speakers three times. The words differ as to the structure
of the second syllable. It was observed that onsets of accent commands
can be most accurately predicted relative to the duration of the nuclear
vowel, with variations depending on the type of consonant preceding
the vowel. Accent command offsets are generally aligned with the offset
of the syllable. The effectiveness of the refined timing rules was
confirmed by an informal perception experiment.
Authors:
Osamu Mizuno, NTT Human Interface Labs. (Japan)
Shin'ya Nakajima, NTT Human Interface Labs. (Japan)
Page (NA) Paper number 1014
Abstract:
This paper proposes new prosodic feature control rules for constructing
semantic prosody control. Research was conducted into mental state
tendencies using tests that examined the perceptions of the subject's
sensibility to the control of synthetic speech prosody. The results
showed the relationships between prosodic control rules and non-verbal
expressions. Duration control reflects information processing state
in spoken dialogues. Sentence final pitch contour control reflects
the reliability of the information. Pitch contour dynamic range control
indicates the speaker's excitement. The pitch contour control from
start to peak pitch contour indicates the speaker's requirement for
attention. Furthermore, for the Multi-layered Speech/Sound Synthesis
Control Language(MSCL) we construct prosodic feature control commands
using prosodic control rules and semantic control commands using the
relationships. MSCL realizes expressive synthetic speech.
Authors:
Mitsuru Nakai, JAIST (Japan)
Hiroshi Shimodaira, JAIST (Japan)
Page (NA) Paper number 998
Abstract:
This paper describes a method of utilizing an ``F0 Reliability Field''
(FRF), which we have proposed in our previous work, for estimating
prosodic commands on F0 contour generation model. This FRF is the time-frequency
representation of F0 likelihood, and an advantage of FRF is that it
is not necessary to consider F0 errors that occur during an automatic
F0 determination. Therefore, it is thought that FRF can be a more
useful feature for automatic prosody analyses than F0 contour, and
our previous paper has reported the validity of FRF on the analysis
of detecting prosodic boundaries in Japanese continuous speech. Moreover,
in this paper, we have examined the validity on the prosodic command
estimation of superpositional model. Experimental results show that
the accuracy of command estimation with FRF is well and it is close
to the accuracy of command estimation with ideal F0 contour that has
no F0 error.
Authors:
Sumio Ohno, Department of Applied Electronics, Science University of Tokyo (Japan)
Hiroya Fujisaki, Department of Applied Electronics, Science University of Tokyo (Japan)
Hideyuki Taguchi, Department of Applied Electronics, Science University of Tokyo (Japan)
Page (NA) Paper number 935
Abstract:
The speech rate is one of the important prosodic parameters for the
naturalness and intelligibility of an utterance. On the basis of the
authors' definition of the relative local speech rate, the present
paper describes an analysis of the effects of changes in global speech
rate, syntactic constituency and lexical accent on the local speech
rate, using short utterances in which these factors are systematically
controlled. Preliminary results indicate that the span of changes in
local speech rate is the syllable rather than mora, and also shows
the interaction between these factors.
Authors:
Sumio Ohno, Department of Applied Electronics, Science University of Tokyo (Japan)
Hiroya Fujisaki, Department of Applied Electronics, Science University of Tokyo (Japan)
Yoshikazu Hara, Department of Applied Electronics, Science University of Tokyo (Japan)
Page (NA) Paper number 936
Abstract:
A command-response model for the process of F0 contour generation has
been presented by Fujisaki and his coworkers. The present paper describes
the results of a study on the variability and speech rate dependency
of the model's parameters in utterances of a speaker of Japanese.
It was found that parameters alpha and beta can be considered to be
practically constant at a given speech rate, while Fb may vary slightly
from utterance to utterance. Among these three parameters, only alpha
was found to have a small but systematic tendency to increase with
the speech rate.
Authors:
Thomas Portele, IKP, University of Bonn (Germany)
Barbara Heuft, IKP, University of Bonn - now: Philips Speech Processing, Aachen (Germany)
Page (NA) Paper number 526
Abstract:
The maximum-based description is a linear and simple parametrization
method of F0 contours. An F0 maximum is characterized by four parameters:
its position, its height, its left and its right slope. An automatic
parametrization algorithm was developed. A perceptual evaluation was
carried out for German and for English. The perceptual equality between
original and parametrized contours was confirmed.
Authors:
Thomas Portele, IKP, University of Bonn (Germany)
Page (NA) Paper number 527
Abstract:
This paper describes the relationships between perceived prominence
as a gradual value and some acoustic-prosodic parameters. Prominence
is used as an intermediate parameter in a speech synthesis system.
A corpus of American English utterances was constructed by measuring
and annotating various linguistic, acoustic and perceptual parameters
and features. The investigation of the corpus revealed some strong
and some rather weak relations between prominence and acoustic-prosodic
parameters that serve as a starting point for the development of prominence-based
rules for the synthesis of American English prosody in a content-to-speech
system.
Authors:
Erhard Rank, Austrian Research Institute for Artificial Intelligence (Austria)
Hannes Pirker, Austrian Research Institute for Artificial Intelligence (Austria)
Page (NA) Paper number 975
Abstract:
We describe the attempt to synthesize emotional speech with a concatenative
speech synthesizer using a parameter space covering not only f0, duration
and amplitude, but also voice quality parameters, spectral energy distribution,
harmonics-to-noise ratio, and articulatory precision. The application
of these extended parameter set offers the possibility to combine the
high segmental quality of concatenative synthesis with a wider range
of control settings needed for the synthesis of natural affected speech.
Authors:
Albert Rilliard, Institut de la Communication Parlee (France)
Véronique Aubergé, Institut de la Communication Parlee (France)
Page (NA) Paper number 1086
Abstract:
We present here a perceptual measure experiment of the linguistic segmentation
hints carried by prosody. The selected paradigm is a dissociation test
between couples of stimuli. The sentences are made of several segmentation
variations in group and clause levels, and couples are made on all
possible combinations of two sentences from the corpus. 20 listeners
are able to associate at the same time the similar area and syntactic
level frontiers. They admit a single syllable translation on the position
of the major syntactic boundary in the reiterated stimuli. They distinguish
the couples that do not show the same frontiers. The results also show
that listeners are puzzled (random choices) when the proposed frontiers
delimit complex segments.
Authors:
Kazuhito Koike, Keio University (Japan)
Hirotaka Suzuki, Keio University (Japan)
Hiroaki Saito, Keio University (Japan)
Page (NA) Paper number 996
Abstract:
Importance of speech prosody is on the increase as spontaneous interaction
between human and machines is asked for. This paper examines how prosody
contributes emotions to speech. Major elements which determine the
emotion are pitch, tempo, and stress of speech. The last two elements
correspond to duration and power of syllables, respectively. We choose
five emotions to be tested; anger, surprise, sorrow, hate, and joy.
To verify our analysis, we have implemented a speech synthesis module
which can easily control prosodic parameters of output speech. Responses
to the synthesized speech show that the parameters of anger, sorrow
and hate are confirmed over 85 %. Experiment results also suggest that
surprise and joy feelings tend to depend less on its prosody.
Authors:
Barbertje M. Streefkerk, Institute of Phonetic Science Amsterdam (The Netherlands)
Louis C.W. Pols, Institute of Phonetic Science Amsterdam (The Netherlands)
Louis F.M. ten Bosch, Lernaut & Hauspie Speech Products N.V. (Belgium)
Page (NA) Paper number 285
Abstract:
This paper describes a first step towards the automatic classification
of prominence (as defined by naive listeners). As a result of a listening
experiment each word in 500 sentences was marked with a rating scale
between '0' (non-prominent) and '10' (very prominent). These prominence
labels are compared with the following acoustical features: loudness
of each vowel, and F0 range and duration of each syllable. A linear
relationship between the rating scale of prominence and these acoustical
features is found. These acoustical features then are used for a preliminary
automatic classification to predict prominence.
Authors:
Masafumi Tamoto, NTT Basic Research Laboratories (Japan)
Takeshi Kawabata, NTT Basic Research Laboratories (Japan)
Page (NA) Paper number 1099
Abstract:
We propose a new discrimination schema for illocutionary acts using
prosodic features based on experimental results. Given the transcribed
sentence with contextual information, the subjects were able to identify
correctly the sentence type of 85% of 290 sentences. With information
about the intonation contour types, they could correctly identify 90%
of illocutionary acts. We find evidence that illocutionary acts can
be signaled by specific contour types. These typical contours are realized
in the sentence final boundary tone; a neutral or falling tone for
assertion and request, a rising tone for question. An intonation contour
is then identified using an algorithm that calculates the range and
slope of the upper and lower bounds of unwarped segmental contour,
and matches these against predefined contour templates. This automated
intonation contour classification, nearly 90% of illocutionary acts
could be correctly identified. (URL: http://www.brl.ntt.co.jp/info/dug/)
Authors:
Wataru Tsukahara, Mech-Info Engineering, University of Tokyo (Japan)
Page (NA) Paper number 955
Abstract:
In human dialog a wide variety of acknowledgments are used. One function
of this seems to be indicating attention, interest, and involvement
to the other speaker, and we believe this is an important factor in
encouraging him and keeping up his interest. Thus, in this paper we
focus on the problem of choosing appropriate acknowledgments at each
point. Based on study of Japanese memory game dialogs, we propose
an algorithm for choosing among acknowledgment responses, including
`hai' (yes), `so' (right), and `un' (mm). The primary factors involved
are aspects of the speaker's internal state, including confidence and
liveliness, as inferred from the context and the speaker's prosody.
Evaluation of naturalness and helpfulness of dialog generated by this
algorithm suggests that judges prefer rule-based responses to randomly
chosen responses, confirming our hypothesis that `sensitive' and subtle
choice of response may improve helpfulness and naturalness of man-machine
spoken language interaction.
Authors:
Chao Wang, MIT Laboratory for Computer Science (USA)
Stephanie Seneff, MIT Laboratory for Computer Science (USA)
Page (NA) Paper number 535
Abstract:
Prosodic cues (namely, fundamental frequency, energy and duration)
provide important information for speech. For a tonal language such
as Chinese, fundamental frequency (F0) plays a critical role in characterizing
tone as well, which is an essential phonemic feature. In this paper,
we describe our work on duration and tone modeling for telephone-quality
continuous Mandarin digits, and the application of these models to
improve recognition. The duration modeling includes a speaking-rate
normalization scheme. A novel F0 extraction algorithm is developed,
and parameters based on orthonormal decomposition of the F0 contour
are extracted for tone recognition. Context dependency is expressed
by ``tri-tone'' models clustered into broad classes. A 20.0% error
rate is achieved for four-tone classification. Over a baseline recognition
performance of 5.1% word error rate, we achieve 31.4% error reduction
with duration models, 23.5% error reduction with tone models, and 39.2%
error reduction with duration and tone models combined.
Authors:
Sandra P. Whiteside, University of Sheffield (U.K.)
Page (NA) Paper number 153
Abstract:
This brief study presents a set of acoustic correlates for a number
of vocal emotions simulated by two actors. Five short sentences were
used in the simulations. The emotions simulated were neutral, cold
anger, hot anger, happiness, sadness, interest and elation. The seventy
sentences were digitised and a number of acoustic analyses were carried
out, which included a number of perturbation measures. The acoustic
parameters investigated were: i) mean overall fundamental frequency
(Hz); ii) overall mean energy (dB); iii) overall mean standard deviation
of energy (dB); iv) mean overall jitter (%); and v) mean overall shimmer
(dB). These acoustic parameters were used to profile the vocal emotions.
Results showed that the actors displayed similarities in their acoustic
profiles for some emotions like anger and sadness, for example. The
results are presented and discussed in brief.
Authors:
Jin-song Zhang, University of Tokyo (Japan)
Keikichi Hirose, University of Tokyo (Japan)
Page (NA) Paper number 674
Abstract:
This paper proposes a scheme of using F0 contours of vowel nuclei to
discriminate Chinese lexical tones. The authors suggest that the F0
contour fragment of a vowel nucleus of a syllable contributes most
to tone perception of the syllable. To correlate the F0 contour with
the phonemic components of a syllable, a tone-based syllabic structure
is also proposed. Tone recognition experiments on a speaker independent
dissyllable words task proved the effectiveness of the proposed method.
Better performance over approaches observing full syllabic F0 contours
indicates that the proposed method is a more robust tone recognition
method.
Authors:
Xiaonong Sean Zhu, ANU (Australia)
Page (NA) Paper number 423
Abstract:
This paper examines the F0 variations during tone sandhi due to various
prosodic factors such as phonation type, length, stress and pitch height.
It will be shown that the F0 height and shape of the second syllable
(S2) in disyllabic words are determined by the interaction of four
conditions: the intervocalic consonant (C2) voicing, the S2 Truncation,
the F0 height of S1, and stress assignment.
Authors:
Natalija Bolfan-Stosic, Acoustic Laboratory for Speech and Hearing,University of Zagreb (Croatia)
Tatjana Prizl, Acoustic Laboratory for Speech and Hearing,University of Zagreb (Croatia)
Page (NA) Paper number 103
Abstract:
A study was undertaken to determine differences between jitter and
shimmer in voices of children with different syndromes. Voices of 60
children, both sexes, aged 7-12 years were analysed by EZ Voice Analysis
Software (program for jitter and shimmer measuring). The main purpose
of this paper has diagnostic background. Obtained results show, which
acoustical indicators of pathological voice are in certain group of
children, and in which shapes they appear. In that way, we try to find
easiest way to explain acoustical characteristics of different voice
pathologies as help in diagnostics. The results indicate that the children
with stuttering and disartric symptoms have higher values almost in
all applied variables than the average values of children from other
groups. Children with Down syndrome and hearing losses exhibited the
most disordered voice quality. Finally, the mixed group (stuttering
with dysphonia) and group of children with dysphonia exhibited the
least pathological characteristics of voice. Obtained results of Analysis
of Variance have shown significant statistical differences in all applied
variables among the groups.
|