Prosody and Emotion 5

The tilt intonation model facilitates automatic analysis and synthesis of intonation. The analysis algorithm detects intonational events in F0 contours and parameterises them in terms of the continuously varying parameters. We describe the analysis system and give results for speaker independent spontaneous dialogue speech. We then describe a synthesis algorithm which can generate F0 contours given a tilt parameterisation of an utterance. We give results showing how well the automatically produced contours match natural ones. The paper concludes with a discussion of the linguistic relevance of the tilt parameters and show that this is both a useful and natural way of representing intonation.

SL980827.PDF (From Author) SL980827.PDF (Rasterized)

TOP

Analysis of Occurrence of Pauses and Their Durations in Japanese Text Reading

Authors:

Hiroya Fujisaki, Department of Applied Electronics, Science University of Tokyo (Japan)
Sumio Ohno, Department of Applied Electronics, Science University of Tokyo (Japan)
Seiji Yamada, Department of Applied Electronics, Science University of Tokyo (Japan)

Page (NA) Paper number 831

Abstract:

Pauses play important roles both for the intelligibility and the naturalness of speech. Their occurrences and durations in text reading are influenced by syntactic structures of the text as well as by physiological constraints of respiration on the part of the speaker. The present paper describes some of the preliminary findings on Japanese text reading, especially on the effects of the syntactic role of the preceding phrase on the rate of occurrence and the duration of a pause at a syntactic boundary.

SL980831.PDF (From Author) SL980831.PDF (Rasterized)

TOP

A Statistical Study of Pitch Target Points in Five Languages

Authors:

Estelle Campione, Université de Provence (France)
Jean Véronis, Université de Provence (France)

Page (NA) Paper number 845

Abstract:

We present the results of a large-scale statistical study of pitch target points in five languages, on a corpus comprising 4 hours 20 minutes of speech and involving 50 different speakers. The entire corpus has been stylized automatically by a technique reducing the F0 contour to a series of target points representing the significant pitch changes. It was then entirely verified by experts using a resynthesis method, in order to ensure that there was no audible difference with the original. The set of ca. 50000 pitch target points thus obtained was then analyzed from a statistical point of view. In this paper we describe the main results of this study, in terms of frequency distribution of target points, pitch movements and relation of pitch movements to time interval. Our study reveals interesting differences across languages and sex.

SL980845.PDF (From Author) SL980845.PDF (Rasterized)

TOP

Fully Automatic Prosody Generator For Text-to-Speech

Authors:

Fabrice Malfrère, Faculté Polytechnique de Mons (Belgium)
Thierry Dutoit, Faculté Polytechnique de Mons (Belgium)
Piet Mertens, K.U. Leuven - Département de Linguistique (Belgium)

Page (NA) Paper number 355

Abstract:

Text-to-Prosody systems based on the use of prosodic databases extracted from natural speech will be a key point for further development of new Text-to-Speech systems. This paper describes a system using such speech databases to generate the rhythm and the intonation of a French written text. The system is based on a very crude chinks 'n chunks prosodic phrasing algorithm and on a prosodic analysis of a natural speech database. The rhythm of the synthetic speech is generated with a CART tree trained on a large mono-speaker speech corpus. The acoustic aspect of the intonation is derived from a set of prosodic patterns automatically derived from the same speech corpus. The system has been tested on single sentences and news paragraphs. Informal listening tests have shown that the resulting prosody is convincing most of the time.

SL980355.PDF (From Author) SL980355.PDF (Rasterized)

0355_01.WAV

(was: sound355_01.wav)

Synthesized French sentence: 'Le petit canard apprend � nager' (The little duck learns to swim).
File type: Sound File
Format: Sound File: WAV
Tech. description: Sound file: 16000Hz, 16 bits/sample, mono, pcm,pc windows wav format
Creating Application:: Unknown
Creating OS: Unknown

TOP

Automatic Prosodic Labeling of 6 Languages

Authors:

Halewijn Vereecken, ELIS, University of Ghent (Belgium)
Jean-Pierre Martens, ELIS, University of Ghent (Belgium)
Cynthia Grover, Lernout & Hauspie Speech Products NV (Belgium)
Justin Fackrell, Lernout & Hauspie Speech Products NV (Belgium)
Bert Van Coile, Lernout & Hauspie Speech Products NV (Belgium)

Page (NA) Paper number 45

Abstract:

This contribution describes a method for the automatic prosodic labeling of multi-lingual speech data. The automatic labeler assigns a boundary strength between 0 and 3 to each word boundary, and a word prominence between 0 and 9 to each word. The speech signal and its orthographic representation are first transformed to feature vectors comprising acoustic and linguistic features such as pitch, duration, energy, part-of-speech, punctuation, word frequency and stress. Next, the feature vectors are mapped to prosodic labels via a cascade of multi-layer perceptrons. Experiments on 6 different languages demonstrate that combining acoustic with linguistic features yields a better performance than obtainable on the basis of acoustic features alone. We also present experiments in which we assess the influence of the quality of the underlying phonetic segmentation and labeling on the prosodic labeling performance.

SL980045.PDF (From Author) SL980045.PDF (Rasterized)

TOP

Automatic Utterance Type Detection Using Suprasegmental Features

Authors:

Helen Wright, Centre for Speech Technology Research, University of Edinburgh (U.K.)

Page (NA) Paper number 575

Abstract:

The goal of the work presented here is to automatically predict the type of an utterance in spoken dialogue by using automatically extracted suprasegmental information. For this task we present and compare three stochastic algorithms: hidden Markov models, artificial neural nets, and classification and regression trees. These models are easily trainable, reasonably robust and fit into the probabilistic framework required for speech recognition. Utterance type detection is dependent on the assumption that different types of utterances have different suprasegmental characteristics. The categorisation of these utterance types is based on the theory of conversation games and consists of 12 move types (e.g. reply to a question, wh-question, acknowledgement). The system is speaker independent and is trained on spontaneous goal-directed dialogue collected from Canadian speakers. This utterance type detector is used in an automatic speech recognition system to reduce word error rate.