Text-To-Speech Synthesis 3

Home
Full List of Titles
1: ICSLP'98 Proceedings
Keynote Speeches
Text-To-Speech Synthesis 1
Spoken Language Models and Dialog 1
Prosody and Emotion 1
Hidden Markov Model Techniques 1
Speaker and Language Recognition 1
Multimodal Spoken Language Processing 1
Isolated Word Recognition
Robust Speech Processing in Adverse Environments 1
Spoken Language Models and Dialog 2
Articulatory Modelling 1
Talking to Infants, Pets and Lovers
Robust Speech Processing in Adverse Environments 2
Spoken Language Models and Dialog 3
Speech Coding 1
Articulatory Modelling 2
Prosody and Emotion 2
Neural Networks, Fuzzy and Evolutionary Methods 1
Utterance Verification and Word Spotting 1 / Speaker Adaptation 1
Text-To-Speech Synthesis 2
Spoken Language Models and Dialog 4
Human Speech Perception 1
Robust Speech Processing in Adverse Environments 3
Speech and Hearing Disorders 1
Prosody and Emotion 3
Spoken Language Understanding Systems 1
Signal Processing and Speech Analysis 1
Spoken Language Generation and Translation 1
Spoken Language Models and Dialog 5
Segmentation, Labelling and Speech Corpora 1
Multimodal Spoken Language Processing 2
Prosody and Emotion 4
Neural Networks, Fuzzy and Evolutionary Methods 2
Large Vocabulary Continuous Speech Recognition 1
Speaker and Language Recognition 2
Signal Processing and Speech Analysis 2
Prosody and Emotion 5
Robust Speech Processing in Adverse Environments 4
Segmentation, Labelling and Speech Corpora 2
Speech Technology Applications and Human-Machine Interface 1
Large Vocabulary Continuous Speech Recognition 2
Text-To-Speech Synthesis 3
Language Acquisition 1
Acoustic Phonetics 1
Speaker Adaptation 2
Speech Coding 2
Hidden Markov Model Techniques 2
Multilingual Perception and Recognition 1
Large Vocabulary Continuous Speech Recognition 3
Articulatory Modelling 3
Language Acquisition 2
Speaker and Language Recognition 3
Text-To-Speech Synthesis 4
Spoken Language Understanding Systems 4
Human Speech Perception 2
Large Vocabulary Continuous Speech Recognition 4
Spoken Language Understanding Systems 2
Signal Processing and Speech Analysis 3
Human Speech Perception 3
Speaker Adaptation 3
Spoken Language Understanding Systems 3
Multimodal Spoken Language Processing 3
Acoustic Phonetics 2
Large Vocabulary Continuous Speech Recognition 5
Speech Coding 3
Language Acquisition 3 / Multilingual Perception and Recognition 2
Segmentation, Labelling and Speech Corpora 3
Text-To-Speech Synthesis 5
Spoken Language Generation and Translation 2
Human Speech Perception 4
Robust Speech Processing in Adverse Environments 5
Text-To-Speech Synthesis 6
Speech Technology Applications and Human-Machine Interface 2
Prosody and Emotion 6
Hidden Markov Model Techniques 3
Speech and Hearing Disorders 2 / Speech Processing for the Speech and Hearing Impaired 1
Human Speech Production
Segmentation, Labelling and Speech Corpora 4
Speaker and Language Recognition 4
Speech Technology Applications and Human-Machine Interface 3
Utterance Verification and Word Spotting 2
Large Vocabulary Continuous Speech Recognition 6
Neural Networks, Fuzzy and Evolutionary Methods 3
Speech Processing for the Speech-Impaired and Hearing-Impaired 2
Prosody and Emotion 7
2: SST Student Day
SST Student Day - Poster Session 1
SST Student Day - Poster Session 2

Author Index
A B C D E F G H I
J K L M N O P Q R
S T U V W X Y Z

Multimedia Files

The IBM Trainable Speech Synthesis System

Authors:

Robert E. Donovan, IBM TJ Watson Research Center (USA)
Ellen M. Eide, IBM TJ Watson Research Center (USA)

Page (NA) Paper number 166

Abstract:

The speech synthesis system described in this paper uses a set of speaker-dependent decision-tree state-clustered hidden Markov models to automatically generate a leaf level segmentation of a large single-speaker continuous-read-speech database. During synthesis, the phone sequence to be synthesised is converted to an acoustic leaf sequence by descending the HMM decision trees. Duration, energy and pitch values are predicted using separate trainable models. To determine the segment sequence to concatenate, a dynamic programming (d.p.) search is performed over all the waveform segments aligned to each leaf in training. The d.p. attempts to ensure that the selected segments join each other spectrally, and have durations, energies and pitches such that the amount of degradation introduced by the subsequent use of TD-PSOLA is minimised; the selected segments are concatenated and modified to have the required prosodic values using TD-PSOLA. The d.p. results in the system effectively selecting variable length units.

SL980166.PDF (From Author) SL980166.PDF (Rasterized)

0166_01.WAV
(was: 0166.WAV)
0166.WAV is the synthetic sentence ``When a sailor in a small craft faces the might of the vast Atlantic Ocean today, he takes the same risks as generations took before him.''.
File type: Sound File
Format: Sound File: WAV
Tech. description: 16kHz, 16 bits per sample, mono, PCM.
Creating Application:: Unknown
Creating OS: Unknown

TOP


ProSynth: An Integrated Prosodic Approach to Device-Independent, Natural-Sounding Speech Synthesis

Authors:

Sarah Hawkins, University of Cambridge (U.K.)
Jill House, University College London (U.K.)
Mark Huckvale, University College London (U.K.)
John Local, University of York (U.K.)
Richard Ogden, University of York (U.K.)

Page (NA) Paper number 538

Abstract:

This paper outlines ProSynth, an approach to speech synthesis which takes a rich linguistic structure as central to the generation of natural-sounding speech. We start from the assumption that the speech signal is informationally rich, and that this acoustic richness reflects linguistic structural richness and underlies the percept of naturalness. Naturalness achieved by structural richness produces a perceptually robust signal intelligible in adverse listening conditions. ProSynth uses syntactic and phonological parses to model the fine acoustic-phonetic detail of real speech, segmentally, temporally and intonationally.

SL980538.PDF (From Author) SL980538.PDF (Rasterized)

TOP


Total Quality Evaluation of Speech Synthesis Systems

Authors:

Jialu Zhang, Institute of Acoustics, Academia Sinica (China)
Shiwei Dong, Institute of Acoustics,Academia Sinica (China)
Ge Yu, Institute of Acoustics, Academia Sinica (China)

Page (NA) Paper number 60

Abstract:

Based on the performance assessment of speech synthesis systems for Chinese the total quality evaluation of them has been carried out regular since 1994. The total quality evaluation includes speech intelligibility test at different levels (syllable, word and sentence), speech naturalness test and anti-interference ability test for phonetic module and text processing ability test for linguistic module. The designing principle of testing materials and testing methods are briefly described and the test results of four text-to-speech (TTS) systems for Chinese are presented in this paper. It is shown that 1. All professional technicians joining the speech naturalness test with the testing crew together overestimated the overall quality of the four tested systems; 2. The word intelligibility and Semantically Unpredicted Sentence (SUS) score are good for evaluating speech synthesis systems; 3. The anti-interference ability of synthetic speech is rather weak about 20 per cent in syllable intelligibility lower than natural speech under condition of S/N=5dB.

SL980060.PDF (From Author) SL980060.PDF (Rasterized)

TOP


Comparative Evaluation of Synthetic Prosody with the PURR Method

Authors:

Gerit P. Sonntag, Institut für Kommunikationsforschung und Phonetik (IKP), Universität Bonn (Germany)
Thomas Portele, Institut für Kommunikationsforschung und Phonetik (IKP), Universität Bonn (Germany)

Page (NA) Paper number 18

Abstract:

In order to evaluate the prosodic output of a speech synthesis system independently from its segmental quality, we have developed a special way to delexicalize speech stimuli which we call PURR (Prosody Unveiling through Restricted Representation). We compared the use of PURR stimuli for the evaluation of prosodic naturalness in three different test designs: magnitude estimation (ME), categorical estimation (CE), and ranking order (RO). Sentences of different types were synthesized by six German synthesis systems. The synthetic utterances and one human voice were comparatively judged by experienced listeners. On the whole the results of all three methods are in good agreement. Choice of stimuli seems to be more important than the choice of method.

SL980018.PDF (From Author) SL980018.PDF (Rasterized)

TOP


SABLE: A Standard For TTS Markup

Authors:

Richard Sproat, Bell Labs, Lucent Technologies (USA)
Andrew Hunt, Sun Microsystems, Inc (USA)
Mari Ostendorf, Boston University (USA)
Paul Taylor, CSTR, University of Edinburgh (U.K.)
Alan W. Black, CSTR, University of Edinburgh (U.K.)
Kevin Lenzo, Carnegie Mellon University (USA)
Mike Eddington, BT Labs (U.K.)

Page (NA) Paper number 40

Abstract:

Currently, speech synthesizers are controlled by a multitude of proprietary tag sets. These tag sets vary substantially across synthesizers and are an inhibitor to the adoption of speech synthesis technology by developers. SABLE is an XML/SGML-based markup scheme for text-to-speech synthesis, developed to address the need for a common TTS control paradigm. This paper presents an overview of the SABLE v0.2 specification, and provides links to websites with further information on SABLE.

SL980040.PDF (From Author) SL980040.PDF (Rasterized)

TOP


Prosodic vs. Segmental Contributions to Naturalness in a Diphone Synthesizer

Authors:

H. Timothy Bunnell, The duPont Hosp. for Children & Univ. of DE (USA)
Steve R. Hoskins, University of Delaware & duPont Hosp. for Children (USA)
Debra Yarrington, University of Delaware & duPont Hosp. for Children (USA)

Page (NA) Paper number 857

Abstract:

The relative contributions of segmental versus prosodic factors to the perceived naturalness of synthetic speech was measured by transplanting prosody between natural speech and the output of a diphone synthesizer. A small corpus was created containing matched sentence pairs wherein one member of the pair was a natural utterance and the other was a synthetic utterance generated with diphone data from the same talker. Two additional sentences were formed from each sentence pair by transplanting the prosodic structure between the natural and synthetic members of each pair. In two listening experiments subjects were asked to (a) classify each sentence as "natural" or "synthetic, or (b) rate the naturalness of each sentence. Results showed that the prosodic information was more important than segmental information in both classification and ratings of naturalness.

SL980857.PDF (From Author) SL980857.PDF (Rasterized)

TOP