Text-To-Speech Synthesis 3

The speech synthesis system described in this paper uses a set of speaker-dependent decision-tree state-clustered hidden Markov models to automatically generate a leaf level segmentation of a large single-speaker continuous-read-speech database. During synthesis, the phone sequence to be synthesised is converted to an acoustic leaf sequence by descending the HMM decision trees. Duration, energy and pitch values are predicted using separate trainable models. To determine the segment sequence to concatenate, a dynamic programming (d.p.) search is performed over all the waveform segments aligned to each leaf in training. The d.p. attempts to ensure that the selected segments join each other spectrally, and have durations, energies and pitches such that the amount of degradation introduced by the subsequent use of TD-PSOLA is minimised; the selected segments are concatenated and modified to have the required prosodic values using TD-PSOLA. The d.p. results in the system effectively selecting variable length units.

SL980166.PDF (From Author) SL980166.PDF (Rasterized)

0166_01.WAV

(was: 0166.WAV)

0166.WAV is the synthetic sentence ``When a sailor in a small craft faces the might of the vast Atlantic Ocean today, he takes the same risks as generations took before him.''.
File type: Sound File
Format: Sound File: WAV
Tech. description: 16kHz, 16 bits per sample, mono, PCM.
Creating Application:: Unknown
Creating OS: Unknown

TOP

ProSynth: An Integrated Prosodic Approach to Device-Independent, Natural-Sounding Speech Synthesis

Authors:

Sarah Hawkins, University of Cambridge (U.K.)
Jill House, University College London (U.K.)
Mark Huckvale, University College London (U.K.)
John Local, University of York (U.K.)
Richard Ogden, University of York (U.K.)

Page (NA) Paper number 538

Abstract:

This paper outlines ProSynth, an approach to speech synthesis which takes a rich linguistic structure as central to the generation of natural-sounding speech. We start from the assumption that the speech signal is informationally rich, and that this acoustic richness reflects linguistic structural richness and underlies the percept of naturalness. Naturalness achieved by structural richness produces a perceptually robust signal intelligible in adverse listening conditions. ProSynth uses syntactic and phonological parses to model the fine acoustic-phonetic detail of real speech, segmentally, temporally and intonationally.

SL980538.PDF (From Author) SL980538.PDF (Rasterized)

TOP

Total Quality Evaluation of Speech Synthesis Systems

Authors:

Jialu Zhang, Institute of Acoustics, Academia Sinica (China)
Shiwei Dong, Institute of Acoustics,Academia Sinica (China)
Ge Yu, Institute of Acoustics, Academia Sinica (China)

Page (NA) Paper number 60

Abstract:

Based on the performance assessment of speech synthesis systems for Chinese the total quality evaluation of them has been carried out regular since 1994. The total quality evaluation includes speech intelligibility test at different levels (syllable, word and sentence), speech naturalness test and anti-interference ability test for phonetic module and text processing ability test for linguistic module. The designing principle of testing materials and testing methods are briefly described and the test results of four text-to-speech (TTS) systems for Chinese are presented in this paper. It is shown that 1. All professional technicians joining the speech naturalness test with the testing crew together overestimated the overall quality of the four tested systems; 2. The word intelligibility and Semantically Unpredicted Sentence (SUS) score are good for evaluating speech synthesis systems; 3. The anti-interference ability of synthetic speech is rather weak about 20 per cent in syllable intelligibility lower than natural speech under condition of S/N=5dB.

SL980060.PDF (From Author) SL980060.PDF (Rasterized)

TOP

Comparative Evaluation of Synthetic Prosody with the PURR Method

Authors:

Gerit P. Sonntag, Institut für Kommunikationsforschung und Phonetik (IKP), Universität Bonn (Germany)
Thomas Portele, Institut für Kommunikationsforschung und Phonetik (IKP), Universität Bonn (Germany)

Page (NA) Paper number 18

Abstract:

In order to evaluate the prosodic output of a speech synthesis system independently from its segmental quality, we have developed a special way to delexicalize speech stimuli which we call PURR (Prosody Unveiling through Restricted Representation). We compared the use of PURR stimuli for the evaluation of prosodic naturalness in three different test designs: magnitude estimation (ME), categorical estimation (CE), and ranking order (RO). Sentences of different types were synthesized by six German synthesis systems. The synthetic utterances and one human voice were comparatively judged by experienced listeners. On the whole the results of all three methods are in good agreement. Choice of stimuli seems to be more important than the choice of method.

SL980018.PDF (From Author) SL980018.PDF (Rasterized)

TOP

SABLE: A Standard For TTS Markup

Authors:

Richard Sproat, Bell Labs, Lucent Technologies (USA)
Andrew Hunt, Sun Microsystems, Inc (USA)
Mari Ostendorf, Boston University (USA)
Paul Taylor, CSTR, University of Edinburgh (U.K.)
Alan W. Black, CSTR, University of Edinburgh (U.K.)
Kevin Lenzo, Carnegie Mellon University (USA)
Mike Eddington, BT Labs (U.K.)

Page (NA) Paper number 40

Abstract:

Currently, speech synthesizers are controlled by a multitude of proprietary tag sets. These tag sets vary substantially across synthesizers and are an inhibitor to the adoption of speech synthesis technology by developers. SABLE is an XML/SGML-based markup scheme for text-to-speech synthesis, developed to address the need for a common TTS control paradigm. This paper presents an overview of the SABLE v0.2 specification, and provides links to websites with further information on SABLE.

SL980040.PDF (From Author) SL980040.PDF (Rasterized)

TOP

Prosodic vs. Segmental Contributions to Naturalness in a Diphone Synthesizer

Authors:

H. Timothy Bunnell, The duPont Hosp. for Children & Univ. of DE (USA)
Steve R. Hoskins, University of Delaware & duPont Hosp. for Children (USA)
Debra Yarrington, University of Delaware & duPont Hosp. for Children (USA)

Page (NA) Paper number 857

Abstract:

The relative contributions of segmental versus prosodic factors to the perceived naturalness of synthetic speech was measured by transplanting prosody between natural speech and the output of a diphone synthesizer. A small corpus was created containing matched sentence pairs wherein one member of the pair was a natural utterance and the other was a synthetic utterance generated with diphone data from the same talker. Two additional sentences were formed from each sentence pair by transplanting the prosodic structure between the natural and synthetic members of each pair. In two listening experiments subjects were asked to (a) classify each sentence as "natural" or "synthetic, or (b) rate the naturalness of each sentence. Results showed that the prosodic information was more important than segmental information in both classification and ratings of naturalness.