Prosody and Emotion 2

Home
Full List of Titles
1: ICSLP'98 Proceedings
Keynote Speeches
Text-To-Speech Synthesis 1
Spoken Language Models and Dialog 1
Prosody and Emotion 1
Hidden Markov Model Techniques 1
Speaker and Language Recognition 1
Multimodal Spoken Language Processing 1
Isolated Word Recognition
Robust Speech Processing in Adverse Environments 1
Spoken Language Models and Dialog 2
Articulatory Modelling 1
Talking to Infants, Pets and Lovers
Robust Speech Processing in Adverse Environments 2
Spoken Language Models and Dialog 3
Speech Coding 1
Articulatory Modelling 2
Prosody and Emotion 2
Neural Networks, Fuzzy and Evolutionary Methods 1
Utterance Verification and Word Spotting 1 / Speaker Adaptation 1
Text-To-Speech Synthesis 2
Spoken Language Models and Dialog 4
Human Speech Perception 1
Robust Speech Processing in Adverse Environments 3
Speech and Hearing Disorders 1
Prosody and Emotion 3
Spoken Language Understanding Systems 1
Signal Processing and Speech Analysis 1
Spoken Language Generation and Translation 1
Spoken Language Models and Dialog 5
Segmentation, Labelling and Speech Corpora 1
Multimodal Spoken Language Processing 2
Prosody and Emotion 4
Neural Networks, Fuzzy and Evolutionary Methods 2
Large Vocabulary Continuous Speech Recognition 1
Speaker and Language Recognition 2
Signal Processing and Speech Analysis 2
Prosody and Emotion 5
Robust Speech Processing in Adverse Environments 4
Segmentation, Labelling and Speech Corpora 2
Speech Technology Applications and Human-Machine Interface 1
Large Vocabulary Continuous Speech Recognition 2
Text-To-Speech Synthesis 3
Language Acquisition 1
Acoustic Phonetics 1
Speaker Adaptation 2
Speech Coding 2
Hidden Markov Model Techniques 2
Multilingual Perception and Recognition 1
Large Vocabulary Continuous Speech Recognition 3
Articulatory Modelling 3
Language Acquisition 2
Speaker and Language Recognition 3
Text-To-Speech Synthesis 4
Spoken Language Understanding Systems 4
Human Speech Perception 2
Large Vocabulary Continuous Speech Recognition 4
Spoken Language Understanding Systems 2
Signal Processing and Speech Analysis 3
Human Speech Perception 3
Speaker Adaptation 3
Spoken Language Understanding Systems 3
Multimodal Spoken Language Processing 3
Acoustic Phonetics 2
Large Vocabulary Continuous Speech Recognition 5
Speech Coding 3
Language Acquisition 3 / Multilingual Perception and Recognition 2
Segmentation, Labelling and Speech Corpora 3
Text-To-Speech Synthesis 5
Spoken Language Generation and Translation 2
Human Speech Perception 4
Robust Speech Processing in Adverse Environments 5
Text-To-Speech Synthesis 6
Speech Technology Applications and Human-Machine Interface 2
Prosody and Emotion 6
Hidden Markov Model Techniques 3
Speech and Hearing Disorders 2 / Speech Processing for the Speech and Hearing Impaired 1
Human Speech Production
Segmentation, Labelling and Speech Corpora 4
Speaker and Language Recognition 4
Speech Technology Applications and Human-Machine Interface 3
Utterance Verification and Word Spotting 2
Large Vocabulary Continuous Speech Recognition 6
Neural Networks, Fuzzy and Evolutionary Methods 3
Speech Processing for the Speech-Impaired and Hearing-Impaired 2
Prosody and Emotion 7
2: SST Student Day
SST Student Day - Poster Session 1
SST Student Day - Poster Session 2

Author Index
A B C D E F G H I
J K L M N O P Q R
S T U V W X Y Z

Multimedia Files

De-accentuation: Linguistic Environments and Prosodic Realizations

Authors:

Kai Alter, Max-Planck-Institute of Cognitive Neuroscience, Leipzig (Germany)
Karsten Steinhauer, Max-Planck-Institute of Cognitive Neuroscience, Leipzig (Germany)
Angela D. Friederici, Max-Planck-Institute of Cognitive Neuroscience, Leipzig (Germany)

Page (NA) Paper number 258

Abstract:

In this paper we present a preliminary speech production study concerning the prosodic realization of the syntactic and information structure in German. Firstly, we made predictions for the relative prominence and their assignment with tonal patterns. Secondly, exhaustive acoustic analysis were used to test the expectations. The data of a production experiment with seven non-instructed normal subjects were analyzed and then compared with the data of one patient with prosodic disorders.

SL980258.PDF (From Author) SL980258.PDF (Rasterized)

TOP


Towards an Automatic Classification of Emotions in Speech

Authors:

N. Amir, Center for Technological Education Holon (Israel)
S. Ron, Center for Technological Education Holon (Israel)

Page (NA) Paper number 199

Abstract:

In this paper we discuss a method for extracting emotional state from the speech signal. We describe a methodology for obtaining emotive speech. and a method for identifying the emotion present in the signal. This method is based on analysis of the signal over sliding windows. and extracting a representative parameter set. A set of basic emotions is defined, and for each such emotion a reference point is computed. At each instant the distance of the measured parameter set from the reference points is calculated. and used to compute a fuzzy membership index for each emotion, which we term the "emotional index". Preliminary results are presented, which demonstrate the discriminative abilities of this method when applied to a number of speakers.

SL980199.PDF (Scanned)

TOP


Can We Hear Smile?

Authors:

Marc Schröder, ICP (France)
Véronique Aubergé, ICP (France)
Marie-Agnès Cathiard, ICP (France)

Page (NA) Paper number 439

Abstract:

The amusement expression is both visual and audible in speech. After recording comparable spontaneous, acted, mechanical, reiterated and seduction stimuli, five perceptual experiments were held, mainly based on the hypothesis of prosodically controlled effects of amusement on speech. Results show that audio is partially independent from video, which is as performant as audio-video. Spontaneous speech (involuntary controlled) can be identified in front of acted speech (voluntary controlled). Amusement speech can be distinguished from seduction speech.

SL980439.PDF (From Author) SL980439.PDF (Rasterized)

TOP


The Automatic Marking of Prominence in Spontaneous Speech Using Duration and Part of Speech Information

Authors:

Matthew Aylett, Human Communication Research Centre, University of Edinburgh (U.K.)
Matthew Bull, Human Communication Research Centre (U.K.)

Page (NA) Paper number 825

Abstract:

The work reported in this paper was the result of the need to label a large corpus of spontaneous, task-oriented dialogue with prosodic prominences. A computational model using only word duration, part of speech and a dictionary lookup of each word's canonical phonemic contents was trained against the results of a human coder marking prominence. Because word durations were normalised, it was possible to set a common threshold for all members of a form class above which the lexically stressed syllables were classed as prominent. The method used is presented and the relative importance of duration information, phonemic contents, syllabic context and part of speech information is explored. The automatic coder was validated against unseen material and achieved a 58% agreement with a human coder. Further investigation showed that three humans coders agreed no better with each other than each agreed with the computational model. Thus, although the automatic system did not conform very well to the performance of any one human coder, it conformed as well as another human coder might.

SL980825.PDF (From Author) SL980825.PDF (Rasterized)

TOP


On A Pitch Alteration Technique in Excited Cepstral Spectrum for High Quality TTS

Authors:

JongDeuk Kim, Dept. of Telecommunication Engineering, Soongsil University (Korea)
SeongJoon Baek, School of Electrical Engineering, Seoul National University (Korea)
MyungJin Bae, Dept. of Telecommunication Engineering, Soongsil University (Korea)

Page (NA) Paper number 1020

Abstract:

In the area of the speech synthesis techniques, the waveform coding methods maintain the intelligibility and naturalness of synthetic speech. In order to apply the waveform coding or hybrid coding techniques to synthesis by rule, we must be able to alter the pitches of synthetic speech. In this paper, we propose a new pitch alteration method that minimizes the spectrum distortion by using the behavior of cepstrum. This method splits the spectrum of speech signal into excitation spectrum and formant spectrum and transforms the excitation spectrum into cepstrum domain. The pitch of excitation cepstrum is altered by zero insertion or zero deletion and the pitch altered spectrum is reconstructed in spectrum domain. As a result of performance test, the average spectrum distortion was below 2.29% while that of conventional method is 2.47%.

SL981020.PDF (Scanned)

TOP


Dovetailing of Acoustics and Prosody in Spontaneous Speech Recognition

Authors:

Jan Buckow, University of Erlangen (Germany)
Anton Batliner, University of Erlangen (Germany)
Richard Huber, University of Erlangen (Germany)
Jan Nöth, University of Erlangen (Germany)
Volker Warnke, University of Erlangen (Germany)
Heinrich Niemann, University of Erlangen (Germany)

Page (NA) Paper number 336

Abstract:

In VERBMOBIL we previously augmented the output of a word recognizer with prosodic information. Here we present a new approach of interleaving word recognition and prosodic processing. While we still use the output of a word recognizer to determine phrase boundaries, we do not wait until the end of the utterance before we start processing. Instead we intercept chunks of word hypotheses during the forward search of the recognizer. Neural networks and language models are used to predict phrase boundaries. Those boundary hypotheses, in turn, are used by the recognizer to cut the stream of incoming speech into syntactic-prosodic phrases. Thus, incremental processing is possible. We investigate which features are suited for incremental prosodic processing and compare them w.r.t. classification performance and efficiency. We show that with a set of features that can be computed efficiently classification results are achieved which are almost as good as those with the previously used computationally more expensive features.

SL980336.PDF (From Author) SL980336.PDF (Rasterized)

TOP


A Computational Memory and Processing Model for Prosody

Authors:

Janet E. Cahn, Massachusetts Institute of Technology (USA)

Page (NA) Paper number 991

Abstract:

This paper links prosody to the information in the text and how it is processed by the speaker. It describes the operation and output of Loq, a text-to-speech implementation that includes a model of limited attention and working memory. Attentional limitations are key. Varying the attentional parameter in the simulations varies in turn what counts as given and new in a text, and therefore, the intonational contours with which it is uttered. Currently, the system produces prosody in three different styles: child-like, adult expressive, and knowledgeable. This prosody also exhibits differences within each style -- no two simulations are alike. The limited resource approach captures some of the stylistic and individual variety found in natural prosody.

SL980991.PDF (From Author) SL980991.PDF (Rasterized)

TOP


Convergence Of Fundamental Frequencies In Conversation: If It Happens, Does It Matter?

Authors:

Belinda Collins, Australian National University (Australia)

Page (NA) Paper number 695

Abstract:

This paper explores the existence and nature of accommodation processes within conversation, particularly convergence of fundamental frequency (Fo) of conversational participants over time. The study raises a nuumber of issues related to methodologies for analysing interactional (typically conversational) data. Most important is the issue of the applicability of statistical sampling methods which are independent of the interactional events occurring within the talk. It concludes with suggestions for a methodology that examines long term acoustic phenomena (long term Fo) and relates events at the micro acoustic level to interactional events within a conversation.

SL980695.PDF (From Author) SL980695.PDF (Scanned)

TOP


Analysis and Interpretation of Fundamental Frequency Contours of British English in Terms of a Command-Response Model

Authors:

Hiroya Fujisaki, Department of Applied Electronics, Science University of Tokyo (Japan)
Sumio Ohno, Department of Applied Electronics, Science University of Tokyo (Japan)
Takashi Yagi, Department of Applied Electronics, Science University of Tokyo (Japan)
Takeshi Ono, Department of Applied Electronics, Science University of Tokyo (Japan)

Page (NA) Paper number 830

Abstract:

In order to test the validity of authors' command-response model for the generation of F0 contours in the analysis and interpretation of F0 contours of British English, F0 contours of utterances containing sections that are not usually found in those of the common Japanese are analyzed. The results indicate that these F0 contours can actually be generated by selecting appropriate patterns of commands for phrase and accent in a way that is specific to British English, while using the same model for the mechanism of F0 control as for the common Japanese.

SL980830.PDF (From Author) SL980830.PDF (Rasterized)

TOP


Common Patterns In Word Level Prosody

Authors:

Frode Holm, Speech Technology Laboratory (USA)
Kazue Hata, Speech Technology Laboratory (USA)

Page (NA) Paper number 1038

Abstract:

The task of generating natural human-sounding prosody for text-to-speech (TTS) has historically been one of the most challenging problems that researchers and developers have had to face. TTS systems have in general become infamous for their "robotic" intonations. This paper describes an approach to this problem which endeavors to capture as much detail as possible from speech data, but in a way that avoids the "black boxes" typical of neural networks and some vector clustering algorithms. Unlike these latter methods, our approach may give feedback as to exactly what the crucial parameters are that determine the successful choice of pattern. Focusing on the notion of prosody templates, we confirmed that a representative F0 and duration pattern can be extracted based on stress pattern for a target proper noun which occurs in sentence-initial position.

SL981038.PDF (From Author) SL981038.PDF (Rasterized)

TOP


Prosodic Structure in Japanese Spontaneous Speech

Authors:

Yasuo Horiuchi, Chiba University (Japan)
Akira Ichikawa, Chiba University (Japan)

Page (NA) Paper number 500

Abstract:

In this paper, we introduce a method of generating a prosodic tree structure from the F0 contour of an utterance in order to analyze the information expressed by prosody in Japanese spontaneous dialogue. The connection rate, which means the strength of the relationship between two prosodic units, is defined. By repeatedly combining the two adjacent prosodic units where the rate is high, the tree structure is gradually generated. To determine the parameters objectively, we applied the principal component analysis to 32 dialogues from the Chiba Map Task Dialogue Corpus. Then we applied our method to one dialogue. The results suggested that the prosodic tree based on the first principal component was concerned with the information telling what the speaker wanted to do next and that the prosodic tree based on the second principal component represented the syntactic and grammatical structure.

SL980500.PDF (From Author) SL980500.PDF (Rasterized)

TOP


An Acoustic-Phonetic Description Of Word Tone In Kagoshima Japanese

Authors:

Shunichi Ishihara, Japan Centre (Asian Studies), and Phonetics Laboratory, Department of Linguistics (Arts), The Australian National University (Australia)

Page (NA) Paper number 628

Abstract:

The Japanese dialect of Kagoshima (KJ) has two different surface pitch patterns, (L)0 HL and (L)0 H. In this study, the properties of these two surface pitch patterns of KJ will be acoustically-phonetically described by means of z-score normalisation. Words consisting of two, three, four and five syllables were used in this study (the syllable structure is a CV) as test words, and the F0 of each syllable nucleus was extracted and the raw F0 values were z-score normalised. Two native speakers of KJ (one male and one female) participated in this study. The tonal representation of KJ words will be discussed on the basis of the z-score normalised results.

SL980628.PDF (From Author) SL980628.PDF (Scanned)

TOP


Representing Prosodic Words Using Statistical Models of Moraic Transition of Fundamental Frequency Contours of Japanese

Authors:

Koji Iwano, Department of Information and Communication Engineering, School of Engineering, University of Tokyo (Japan)
Keikichi Hirose, Department of Information and Communication Engineering, School of Engineering, University of Tokyo (Japan)

Page (NA) Paper number 731

Abstract:

We have formerly proposed a statistical model of moraic transitions of fundamental frequency (F0) contours and showed its effectiveness for prosodic boundary detection and accent type recognition. This model represented F0 contours of prosodic words to simultaneously detect and recognize prosodic word boundaries and accent types. This paper proposes a method where prosodic word F0 contours are modeled separately according to their accent types and presence/absence of succeeding pauses. An utterance is regarded as a sequence of prosodic words under a simple grammar. Each moraic F0 contour is represented by a pair of codes; the original shape code and the newly introduced delta code representing the degree of F0 change between the mora in question and its preceding mora. Compared with earlier results, the boundary detection rate improves from 87.7% to 91.5%. Accent type recognition rate reached 76.0% (type 1 accent discrimination).

SL980731.PDF (From Author) SL980731.PDF (Rasterized)

TOP


Disambiguation of Korean Utterances Using Automatic Intonation Recognition

Authors:

Tae-Yeoub Jang, Centre for Speech Technology Research, University of Edinburgh (U.K.)
Minsuck Song, Department of English Language and Literature, Kwandong University (Korea)
Kiyeong Lee, Department of Electronic Communication Engineering, Kwandong University (Korea)

Page (NA) Paper number 547

Abstract:

The paper describes a research on a use of intonation for disambiguating utterance types of Korean spoken sentences. Based on tilt intonation theory (Taylor and Black 1994), two related but separate experiments were performed at speaker independent level, both using the Hidden Markov Model training technique. In the first experiment, a system is established so that rough boundary positions of major intonation events are detected. Subsequently the significant parameters are extracted from the products of the first experiment, which are directly used to train the final models for utterance type disambiguation. Results show that the intonation contour can be used as a significant meaning distinguisher in an automatic speech recognition system of Korean as well as in a natural human communication system.

SL980547.PDF (From Author) SL980547.PDF (Rasterized)

TOP


Multi-Level Rhythm Control for Speech Synthesis Using Hybrid Data Driven and Rule-Based Approaches

Authors:

Oliver Jokisch, Technical Acoustics Laboratory, Dresden University of Technology (Germany)
Diane Hirschfeld, Technical Acoustics Laboratory, Dresden University of Technology (Germany)
Matthias Eichner, Technical Acoustics Laboratory, Dresden University of Technology (Germany)
Rüdiger Hoffmann, Technical Acoustics Laboratory, Dresden University of Technology (Germany)

Page (NA) Paper number 855

Abstract:

This paper presents: a multi-level concept to generate the speech rhythm in the Dresden TTS system for German (DreSS). The rhythm control includes the phrase, the syllabic and the phonemic level. The concept allows the alternative use of rule-based or statistical, but also data driven methods on these levels. To create the rules and to train a neural network, a new speech corpus from original speakers of the diphone-based inventories has been recorded. The corpus covers texts and single utterances and is subdivided into phrase, syllabic and phonemic databases. First results indicating that the rule-based and the train-based methods generate a comparable speech rhythm, if the databases are uniform. The stepwise duration control on several prosodic levels shows promise as a method of producing a flexible rhythm depending on the specific TTS application.

SL980855.PDF (From Author) SL980855.PDF (Rasterized)

TOP


EGG Model of Ditoneme in Mandarin

Authors:

Jiangping Kong, EE Dept. of City University of Hong Kong (China)

Page (NA) Paper number 104

Abstract:

This paper concerns with the study of EGG (electroglottalgram by laryngograph) model of ditoneme in Mandarin. The parameters for establishing models are fundamental frequency (F0), which is regarded as reference, speed quotient and open quotient, which are all extracted from the EGG signal by using the software EGG.exe, an option of CSL, Model 4300B, KAY. The result shows that speed quotient and open quotient have close relationships with the F0 in different ditonemes. In general, speed quotient and open quotient will decrease, when F0 increases in sustained vowels. But in the ditonemes, speed quotient and open quotient show different natures according to the position and environment. The conclusion is that EGG models of ditonemes are composed of the patterns of F0, speed quotient and open quotient in Mandarin.

SL980104.PDF (From Author) SL980104.PDF (Rasterized)

TOP


Temporal Organization of Speech for Normal and Fast Rates

Authors:

Geetha Krishnan, Carnegie Mellon University (USA)
Wayne Ward, Carnegie Mellon University (USA)

Page (NA) Paper number 930

Abstract:

In this study predictors of speech rate that are sensitive to local and global rate changes, and relevant to different types of speakers, were examined. Two groups of subjects, normal and disfluent speakers (whose speech was clinically rated as "slow"), provided speech samples at normal and fast rates. Samples were segmented into interstress intervals (ISI) of varying length (i.e., varying number of syllables). The compressibility of components within ISIs of varying length provided information on local rate control strategies. The fast speech samples were useful for examining strategies used in global rate increases. Stressed vowels and intervowel intervals (IVI) showed similar trends in compression for both speakers, for local and global rate increases. We then investigated two measures of speech rate based on intervowel intervals: the ratio measure (IVI/ISI) and the average IVI. High correlation of average IVI with phone rate was found. Results of speech rate estimations are presented.

SL980930.PDF (From Author) SL980930.PDF (Rasterized)

TOP


A Syllable-based Generalization of Japanese Accentuation

Authors:

Haruo Kubozono, Kobe University (Japan)

Page (NA) Paper number 105

Abstract:

One of the major findings of the recent linguistic research on Japanese is that the syllable plays a pivotal role in a variety of phonological and morphological phenomena in the mora-based prosodic system of this language. This paper attempts to reinforce this argument by proposing a significant generalization of Japanese accentuation in terms of `syllable weight', an idea that each syllable has a certain weight according to its phonological configuration. Specifically, this analysis reveals that Japanese accentuation is strikingly similar to that of Latin and many languages with a Latin-type accent system, e.g. English. Moreover, a sociolinguistic analysis of the accentual changes currently in progress demonstrates that Japanese accentuation is becoming increasingly similar to the Latin-type accent system, where the syllable plays a primary role.

SL980105.PDF (From Author) SL980105.PDF (Rasterized)

TOP


Non-Adjacent Segmental Effects in Tonal Realization of Accentual Phrase in Seoul Korean

Authors:

Hyuck-Joon Lee, UCLA (USA)

Page (NA) Paper number 903

Abstract:

This paper investigates the degree to which an onset consonant of an accentual phrase affects the f0 of the following syllables within the phrase in Seoul Korean. Korean tense or aspirated onset consonants raise f0 values of the following adjacent vowel, and when they are positioned on the first syllable onset of an accentual phrase, they continuously raise f0 values of the following non-adjacent vowels. This f0 raising after aspirated or tense consonants supports the previous claim that the microprosody in Korean is phonologized in phrase initial position. The results also confirms the previous claim regarding the location of the underlying 4 tones of the accentual phrase and the interpolation hypothesis.

SL980903.PDF (From Author) SL980903.PDF (Rasterized)

TOP


Improvement on Connected Numbers Recognition Using Prosodic Information

Authors:

Eduardo López, ETSIT-UPM (Spain)
Javier Caminero, Telefonica I+D (Spain)
Ismael Cortázar, Telefonica I+D (Spain)
Luis A. Hernández, ETSIT-UPM (Spain)

Page (NA) Paper number 353

Abstract:

In this paper we propose a strategy to improve the performance of a connected number recognition system in Spanish using prosodic information. Prosodic information is included as the detection of pitch movements between what some studies of intonation in Spanish called melodic units. The basic linguistic background of our approach together with the specific strategies to detect and correct ambiguities and recognition errors are discussed. Experimental results show a 16% of reduction in recognition errors for our state-of-the-art connected number recognizer, and the possibility to solve ambiguities unable to be considered by the recognizer.

SL980353.PDF (From Author) SL980353.PDF (Rasterized)

TOP


Phonetic Investigation of Boundary Pitch Movements in Japanese

Authors:

Kazuaki Maeda, Univ of Pennsylvania and Bell Labs (USA)
Jennifer J. Venditti, Bell Labs and Ohio State Univ (USA)

Page (NA) Paper number 800

Abstract:

Pitch movements at the boundaries of sentence-medial and final phrases in Japanese can provide a cue to the speaker's intention. For example, the phrase /Nagano-de/ 'in Nagano' can be uttered with different rising and/or falling pitch movements on the the final mora /de/ to convey meanings such as clarification, incredulity, prominence in the discourse, insistence, etc. The identification of these movements is important not only for spoken language understanding systems, but also for natural-sounding speech synthesis. The current study examines F0 shape, height, and alignment characteristics of four distinct sentence-medial boundary rises. We compare these types with accented and focused unaccented words containing identical phonetic segments, and discuss a number of different possible phonological analyses of the pitch movements.

SL980800.PDF (From Author) SL980800.PDF (Rasterized)

TOP


Phonetic and Phonological Characteristics of Paralinguistic Information in Spoken Japanese

Authors:

Kikuo Maekawa, The National Language Research Institute (Japan)

Page (NA) Paper number 997

Abstract:

Three Japanese sentences were uttered repeatedly by three speakers with paralinguistic information indicating A(dmiration), D(issapointment), F(ocus), I(ndifference), S(uspicion), and N(eutral). A perception test using all recorded materials revealed that the average correct perception rate was higher than 80 percent. Acoustic analyses revealed the following phonetic characteristics: - Considerable elongation of utterance duration in types A, D, and S. - The first and last morae were more elastic in duration than the others. - Narrowed pitch range in type D and enlarged pitch range in A, F, and S. - Characteristic low pitch at the beginning of types A, D and S. - Delayed accentual peak location in types A, D, and S. - Systematic distributional difference of sentence-final vowels on the F1-F2 formant plane. - Seeming 'Laryngealization' in the initial low-pitched portion of types S, A, and D.

SL980997.PDF (From Author) SL980997.PDF (Rasterized)

0997_01.WAV
(was: 0997.wav)
Typical utterances of sentence 1) uttered with paralinguistic information types A,D,F,I,N, and S, plus an utterance of sentence 3) type N.
File type: Sound File
Format: Sound File: WAV
Tech. description: 11025Hz-16bit samplling, mono
Creating Application:: Creative SoundStudio
Creating OS: Win95
0997_02.PDF
(was: 0997.gif)
Instructions given to the speakers at the time of recording and also to the subjects of perception test.
File type: Image File
Format: Image : GIF
Tech. description: None
Creating Application:: Imagetool on Solaris
Creating OS: Solaris 2.6

TOP


ToBI Accent Type Recognition

Authors:

Arman Maghbouleh, Stanford University (USA)

Page (NA) Paper number 632

Abstract:

This paper describes work in progress for recognizing a subset of ToBI intonation labels (H*, L+H*, L*, !H*, L+!H*, no accent). Initially, duration characteristics are used to classify syllables as accented or not. The accented syllables are then subclassified based on fundamental frequency, F0, values. Potential F0 intonation gestures are schematized by connected line segments within a window around a given syllable. The schematizations are found using spline-basis linear regression. The regression weights on F0 points are varied in order to discount segmental effects and F0 detection errors. Parameters based on the line segments are then used to perform the subclassification. This paper presents new results in recognizing L*, L+H*, and L+!H* accents. In addition, the models presented here perform comparably (80% overall, and 74% accent type recognition) to models which do not distinguish bitonal accents.

SL980632.PDF (From Author) SL980632.PDF (Rasterized)

TOP


The Influence of Syllable Structure on the Timing of Intonational Events in German

Authors:

Hansjörg Mixdorff, Dresden University of Technology (Germany)
Hiroya Fujisaki, Science University of Tokyo (Japan)

Page (NA) Paper number 707

Abstract:

In earlier studies the authors developed a model of German intonation based on the quantitative Fujisaki-model. The present study examines the influence of the segmental structure of an accented syllable on the timing of accent commands. It aims at developing refined alignment rules for speech synthesis. The corpus consists of 67 three-syllable words of German with word-accent on the second syllable produced by three male speakers three times. The words differ as to the structure of the second syllable. It was observed that onsets of accent commands can be most accurately predicted relative to the duration of the nuclear vowel, with variations depending on the type of consonant preceding the vowel. Accent command offsets are generally aligned with the offset of the syllable. The effectiveness of the refined timing rules was confirmed by an informal perception experiment.

SL980707.PDF (From Author) SL980707.PDF (Rasterized)

TOP


New Prosodic Control Rules For Expressive Synthetic Speech

Authors:

Osamu Mizuno, NTT Human Interface Labs. (Japan)
Shin'ya Nakajima, NTT Human Interface Labs. (Japan)

Page (NA) Paper number 1014

Abstract:

This paper proposes new prosodic feature control rules for constructing semantic prosody control. Research was conducted into mental state tendencies using tests that examined the perceptions of the subject's sensibility to the control of synthetic speech prosody. The results showed the relationships between prosodic control rules and non-verbal expressions. Duration control reflects information processing state in spoken dialogues. Sentence final pitch contour control reflects the reliability of the information. Pitch contour dynamic range control indicates the speaker's excitement. The pitch contour control from start to peak pitch contour indicates the speaker's requirement for attention. Furthermore, for the Multi-layered Speech/Sound Synthesis Control Language(MSCL) we construct prosodic feature control commands using prosodic control rules and semantic control commands using the relationships. MSCL realizes expressive synthetic speech.

SL981014.PDF (From Author) SL981014.PDF (Rasterized)

TOP


The Use of F0 Reliability Function for Prosodic Command Analysis on F0 Contour Generation Model

Authors:

Mitsuru Nakai, JAIST (Japan)
Hiroshi Shimodaira, JAIST (Japan)

Page (NA) Paper number 998

Abstract:

This paper describes a method of utilizing an ``F0 Reliability Field'' (FRF), which we have proposed in our previous work, for estimating prosodic commands on F0 contour generation model. This FRF is the time-frequency representation of F0 likelihood, and an advantage of FRF is that it is not necessary to consider F0 errors that occur during an automatic F0 determination. Therefore, it is thought that FRF can be a more useful feature for automatic prosody analyses than F0 contour, and our previous paper has reported the validity of FRF on the analysis of detecting prosodic boundaries in Japanese continuous speech. Moreover, in this paper, we have examined the validity on the prosodic command estimation of superpositional model. Experimental results show that the accuracy of command estimation with FRF is well and it is close to the accuracy of command estimation with ideal F0 contour that has no F0 error.

SL980998.PDF (From Author) SL980998.PDF (Rasterized)

TOP


Analysis of Effects of Lexical Accent, Syntax, and Global Speech Rate upon the Local Speech Rate

Authors:

Sumio Ohno, Department of Applied Electronics, Science University of Tokyo (Japan)
Hiroya Fujisaki, Department of Applied Electronics, Science University of Tokyo (Japan)
Hideyuki Taguchi, Department of Applied Electronics, Science University of Tokyo (Japan)

Page (NA) Paper number 935

Abstract:

The speech rate is one of the important prosodic parameters for the naturalness and intelligibility of an utterance. On the basis of the authors' definition of the relative local speech rate, the present paper describes an analysis of the effects of changes in global speech rate, syntactic constituency and lexical accent on the local speech rate, using short utterances in which these factors are systematically controlled. Preliminary results indicate that the span of changes in local speech rate is the syllable rather than mora, and also shows the interaction between these factors.

SL980935.PDF (From Author) SL980935.PDF (Rasterized)

TOP


On the Effects of Speech Rate upon Parameters of the Command-Response Model for the Fundamental Frequency Contours of Speech

Authors:

Sumio Ohno, Department of Applied Electronics, Science University of Tokyo (Japan)
Hiroya Fujisaki, Department of Applied Electronics, Science University of Tokyo (Japan)
Yoshikazu Hara, Department of Applied Electronics, Science University of Tokyo (Japan)

Page (NA) Paper number 936

Abstract:

A command-response model for the process of F0 contour generation has been presented by Fujisaki and his coworkers. The present paper describes the results of a study on the variability and speech rate dependency of the model's parameters in utterances of a speaker of Japanese. It was found that parameters alpha and beta can be considered to be practically constant at a given speech rate, while Fb may vary slightly from utterance to utterance. Among these three parameters, only alpha was found to have a small but systematic tendency to increase with the speech rate.

SL980936.PDF (From Author) SL980936.PDF (Rasterized)

TOP


The Maximum-Based Description of F0 Contours and its Application to English

Authors:

Thomas Portele, IKP, University of Bonn (Germany)
Barbara Heuft, IKP, University of Bonn - now: Philips Speech Processing, Aachen (Germany)

Page (NA) Paper number 526

Abstract:

The maximum-based description is a linear and simple parametrization method of F0 contours. An F0 maximum is characterized by four parameters: its position, its height, its left and its right slope. An automatic parametrization algorithm was developed. A perceptual evaluation was carried out for German and for English. The perceptual equality between original and parametrized contours was confirmed.

SL980526.PDF (From Author) SL980526.PDF (Rasterized)

TOP


Perceived Prominence and Acoustic Parameters in American English

Authors:

Thomas Portele, IKP, University of Bonn (Germany)

Page (NA) Paper number 527

Abstract:

This paper describes the relationships between perceived prominence as a gradual value and some acoustic-prosodic parameters. Prominence is used as an intermediate parameter in a speech synthesis system. A corpus of American English utterances was constructed by measuring and annotating various linguistic, acoustic and perceptual parameters and features. The investigation of the corpus revealed some strong and some rather weak relations between prominence and acoustic-prosodic parameters that serve as a starting point for the development of prominence-based rules for the synthesis of American English prosody in a content-to-speech system.

SL980527.PDF (From Author) SL980527.PDF (Rasterized)

TOP


Generating Emotional Speech with a Concatenative Synthesizer

Authors:

Erhard Rank, Austrian Research Institute for Artificial Intelligence (Austria)
Hannes Pirker, Austrian Research Institute for Artificial Intelligence (Austria)

Page (NA) Paper number 975

Abstract:

We describe the attempt to synthesize emotional speech with a concatenative speech synthesizer using a parameter space covering not only f0, duration and amplitude, but also voice quality parameters, spectral energy distribution, harmonics-to-noise ratio, and articulatory precision. The application of these extended parameter set offers the possibility to combine the high segmental quality of concatenative synthesis with a wider range of control settings needed for the synthesis of natural affected speech.

SL980975.PDF (From Author) SL980975.PDF (Rasterized)

TOP


A Perceptive Measure of Pure Prosody Linguistic Functions with Reiterant Sentences

Authors:

Albert Rilliard, Institut de la Communication Parlee (France)
Véronique Aubergé, Institut de la Communication Parlee (France)

Page (NA) Paper number 1086

Abstract:

We present here a perceptual measure experiment of the linguistic segmentation hints carried by prosody. The selected paradigm is a dissociation test between couples of stimuli. The sentences are made of several segmentation variations in group and clause levels, and couples are made on all possible combinations of two sentences from the corpus. 20 listeners are able to associate at the same time the similar area and syntactic level frontiers. They admit a single syllable translation on the position of the major syntactic boundary in the reiterated stimuli. They distinguish the couples that do not show the same frontiers. The results also show that listeners are puzzled (random choices) when the proposed frontiers delimit complex segments.

SL981086.PDF (From Author) SL981086.PDF (Rasterized)

TOP


Prosodic Parameters in Emotional Speech

Authors:

Kazuhito Koike, Keio University (Japan)
Hirotaka Suzuki, Keio University (Japan)
Hiroaki Saito, Keio University (Japan)

Page (NA) Paper number 996

Abstract:

Importance of speech prosody is on the increase as spontaneous interaction between human and machines is asked for. This paper examines how prosody contributes emotions to speech. Major elements which determine the emotion are pitch, tempo, and stress of speech. The last two elements correspond to duration and power of syllables, respectively. We choose five emotions to be tested; anger, surprise, sorrow, hate, and joy. To verify our analysis, we have implemented a speech synthesis module which can easily control prosodic parameters of output speech. Responses to the synthesized speech show that the parameters of anger, sorrow and hate are confirmed over 85 %. Experiment results also suggest that surprise and joy feelings tend to depend less on its prosody.

SL980996.PDF (From Author) SL980996.PDF (Rasterized)

TOP


Automatic Detection of Prominence (as Defined by Listeners' Judgements) in Read Aloud Dutch Sentences

Authors:

Barbertje M. Streefkerk, Institute of Phonetic Science Amsterdam (The Netherlands)
Louis C.W. Pols, Institute of Phonetic Science Amsterdam (The Netherlands)
Louis F.M. ten Bosch, Lernaut & Hauspie Speech Products N.V. (Belgium)

Page (NA) Paper number 285

Abstract:

This paper describes a first step towards the automatic classification of prominence (as defined by naive listeners). As a result of a listening experiment each word in 500 sentences was marked with a rating scale between '0' (non-prominent) and '10' (very prominent). These prominence labels are compared with the following acoustical features: loudness of each vowel, and F0 range and duration of each syllable. A linear relationship between the rating scale of prominence and these acoustical features is found. These acoustical features then are used for a preliminary automatic classification to predict prominence.

SL980285.PDF (From Author) SL980285.PDF (Rasterized)

TOP


A Schema for Illocutionary Act Identification With Prosodic Feature

Authors:

Masafumi Tamoto, NTT Basic Research Laboratories (Japan)
Takeshi Kawabata, NTT Basic Research Laboratories (Japan)

Page (NA) Paper number 1099

Abstract:

We propose a new discrimination schema for illocutionary acts using prosodic features based on experimental results. Given the transcribed sentence with contextual information, the subjects were able to identify correctly the sentence type of 85% of 290 sentences. With information about the intonation contour types, they could correctly identify 90% of illocutionary acts. We find evidence that illocutionary acts can be signaled by specific contour types. These typical contours are realized in the sentence final boundary tone; a neutral or falling tone for assertion and request, a rising tone for question. An intonation contour is then identified using an algorithm that calculates the range and slope of the upper and lower bounds of unwarped segmental contour, and matches these against predefined contour templates. This automated intonation contour classification, nearly 90% of illocutionary acts could be correctly identified. (URL: http://www.brl.ntt.co.jp/info/dug/)

SL981099.PDF (From Author) SL981099.PDF (Rasterized)

TOP


An Algorithm for Choosing Japanese Acknowledgments using Prosodic Cues and Context

Authors:

Wataru Tsukahara, Mech-Info Engineering, University of Tokyo (Japan)

Page (NA) Paper number 955

Abstract:

In human dialog a wide variety of acknowledgments are used. One function of this seems to be indicating attention, interest, and involvement to the other speaker, and we believe this is an important factor in encouraging him and keeping up his interest. Thus, in this paper we focus on the problem of choosing appropriate acknowledgments at each point. Based on study of Japanese memory game dialogs, we propose an algorithm for choosing among acknowledgment responses, including `hai' (yes), `so' (right), and `un' (mm). The primary factors involved are aspects of the speaker's internal state, including confidence and liveliness, as inferred from the context and the speaker's prosody. Evaluation of naturalness and helpfulness of dialog generated by this algorithm suggests that judges prefer rule-based responses to randomly chosen responses, confirming our hypothesis that `sensitive' and subtle choice of response may improve helpfulness and naturalness of man-machine spoken language interaction.

SL980955.PDF (From Author) SL980955.PDF (Rasterized)

TOP


A Study of Tones and Tempo in Continuous Mandarin Digit Strings and their Application in Telephone Quality Speech Recognition

Authors:

Chao Wang, MIT Laboratory for Computer Science (USA)
Stephanie Seneff, MIT Laboratory for Computer Science (USA)

Page (NA) Paper number 535

Abstract:

Prosodic cues (namely, fundamental frequency, energy and duration) provide important information for speech. For a tonal language such as Chinese, fundamental frequency (F0) plays a critical role in characterizing tone as well, which is an essential phonemic feature. In this paper, we describe our work on duration and tone modeling for telephone-quality continuous Mandarin digits, and the application of these models to improve recognition. The duration modeling includes a speaking-rate normalization scheme. A novel F0 extraction algorithm is developed, and parameters based on orthonormal decomposition of the F0 contour are extracted for tone recognition. Context dependency is expressed by ``tri-tone'' models clustered into broad classes. A 20.0% error rate is achieved for four-tone classification. Over a baseline recognition performance of 5.1% word error rate, we achieve 31.4% error reduction with duration models, 23.5% error reduction with tone models, and 39.2% error reduction with duration and tone models combined.

SL980535.PDF (From Author) SL980535.PDF (Rasterized)

TOP


Simulated Emotions: an Acoustic Study of Voice and Perturbation Measures

Authors:

Sandra P. Whiteside, University of Sheffield (U.K.)

Page (NA) Paper number 153

Abstract:

This brief study presents a set of acoustic correlates for a number of vocal emotions simulated by two actors. Five short sentences were used in the simulations. The emotions simulated were neutral, cold anger, hot anger, happiness, sadness, interest and elation. The seventy sentences were digitised and a number of acoustic analyses were carried out, which included a number of perturbation measures. The acoustic parameters investigated were: i) mean overall fundamental frequency (Hz); ii) overall mean energy (dB); iii) overall mean standard deviation of energy (dB); iv) mean overall jitter (%); and v) mean overall shimmer (dB). These acoustic parameters were used to profile the vocal emotions. Results showed that the actors displayed similarities in their acoustic profiles for some emotions like anger and sadness, for example. The results are presented and discussed in brief.

SL980153.PDF (From Author) SL980153.PDF (Rasterized)

TOP


A Robust Tone Recognition Method of Chinese Based on Sub-syllabic F0 Contours

Authors:

Jin-song Zhang, University of Tokyo (Japan)
Keikichi Hirose, University of Tokyo (Japan)

Page (NA) Paper number 674

Abstract:

This paper proposes a scheme of using F0 contours of vowel nuclei to discriminate Chinese lexical tones. The authors suggest that the F0 contour fragment of a vowel nucleus of a syllable contributes most to tone perception of the syllable. To correlate the F0 contour with the phonemic components of a syllable, a tone-based syllabic structure is also proposed. Tone recognition experiments on a speaker independent dissyllable words task proved the effectiveness of the proposed method. Better performance over approaches observing full syllabic F0 contours indicates that the proposed method is a more robust tone recognition method.

SL980674.PDF (From Author) SL980674.PDF (Rasterized)

TOP


The Microprosodics of Tone Sandhi in Shanghai Disyllabic Compounds

Authors:

Xiaonong Sean Zhu, ANU (Australia)

Page (NA) Paper number 423

Abstract:

This paper examines the F0 variations during tone sandhi due to various prosodic factors such as phonation type, length, stress and pitch height. It will be shown that the F0 height and shape of the second syllable (S2) in disyllabic words are determined by the interaction of four conditions: the intervocalic consonant (C2) voicing, the S2 Truncation, the F0 height of S1, and stress assignment.

SL980423.PDF (From Author) SL980423.PDF (Rasterized)

TOP


Jitter And Shimmer Differences Between Pathological Voices Of School Children

Authors:

Natalija Bolfan-Stosic, Acoustic Laboratory for Speech and Hearing,University of Zagreb (Croatia)
Tatjana Prizl, Acoustic Laboratory for Speech and Hearing,University of Zagreb (Croatia)

Page (NA) Paper number 103

Abstract:

A study was undertaken to determine differences between jitter and shimmer in voices of children with different syndromes. Voices of 60 children, both sexes, aged 7-12 years were analysed by EZ Voice Analysis Software (program for jitter and shimmer measuring). The main purpose of this paper has diagnostic background. Obtained results show, which acoustical indicators of pathological voice are in certain group of children, and in which shapes they appear. In that way, we try to find easiest way to explain acoustical characteristics of different voice pathologies as help in diagnostics. The results indicate that the children with stuttering and disartric symptoms have higher values almost in all applied variables than the average values of children from other groups. Children with Down syndrome and hearing losses exhibited the most disordered voice quality. Finally, the mixed group (stuttering with dysphonia) and group of children with dysphonia exhibited the least pathological characteristics of voice. Obtained results of Analysis of Variance have shown significant statistical differences in all applied variables among the groups.

SL980103.PDF (From Author) SL980103.PDF (Rasterized)

TOP