Full List of Titles 1: ICSLP'98 Proceedings 2: SST Student Day Author Index A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Multimedia Files |
The IBM Trainable Speech Synthesis SystemAuthors:
Robert E. Donovan, IBM TJ Watson Research Center (USA)
Page (NA) Paper number 166Abstract:The speech synthesis system described in this paper uses a set of speaker-dependent decision-tree state-clustered hidden Markov models to automatically generate a leaf level segmentation of a large single-speaker continuous-read-speech database. During synthesis, the phone sequence to be synthesised is converted to an acoustic leaf sequence by descending the HMM decision trees. Duration, energy and pitch values are predicted using separate trainable models. To determine the segment sequence to concatenate, a dynamic programming (d.p.) search is performed over all the waveform segments aligned to each leaf in training. The d.p. attempts to ensure that the selected segments join each other spectrally, and have durations, energies and pitches such that the amount of degradation introduced by the subsequent use of TD-PSOLA is minimised; the selected segments are concatenated and modified to have the required prosodic values using TD-PSOLA. The d.p. results in the system effectively selecting variable length units.
|
0166_01.WAV(was: 0166.WAV) | 0166.WAV is the synthetic sentence ``When a sailor in a small craft
faces the might of the vast Atlantic Ocean today, he takes the same
risks as generations took before him.''. File type: Sound File Format: Sound File: WAV Tech. description: 16kHz, 16 bits per sample, mono, PCM. Creating Application:: Unknown Creating OS: Unknown |
Sarah Hawkins, University of Cambridge (U.K.)
Jill House, University College London (U.K.)
Mark Huckvale, University College London (U.K.)
John Local, University of York (U.K.)
Richard Ogden, University of York (U.K.)
This paper outlines ProSynth, an approach to speech synthesis which takes a rich linguistic structure as central to the generation of natural-sounding speech. We start from the assumption that the speech signal is informationally rich, and that this acoustic richness reflects linguistic structural richness and underlies the percept of naturalness. Naturalness achieved by structural richness produces a perceptually robust signal intelligible in adverse listening conditions. ProSynth uses syntactic and phonological parses to model the fine acoustic-phonetic detail of real speech, segmentally, temporally and intonationally.
Jialu Zhang, Institute of Acoustics, Academia Sinica (China)
Shiwei Dong, Institute of Acoustics,Academia Sinica (China)
Ge Yu, Institute of Acoustics, Academia Sinica (China)
Based on the performance assessment of speech synthesis systems for Chinese the total quality evaluation of them has been carried out regular since 1994. The total quality evaluation includes speech intelligibility test at different levels (syllable, word and sentence), speech naturalness test and anti-interference ability test for phonetic module and text processing ability test for linguistic module. The designing principle of testing materials and testing methods are briefly described and the test results of four text-to-speech (TTS) systems for Chinese are presented in this paper. It is shown that 1. All professional technicians joining the speech naturalness test with the testing crew together overestimated the overall quality of the four tested systems; 2. The word intelligibility and Semantically Unpredicted Sentence (SUS) score are good for evaluating speech synthesis systems; 3. The anti-interference ability of synthetic speech is rather weak about 20 per cent in syllable intelligibility lower than natural speech under condition of S/N=5dB.
Gerit P. Sonntag, Institut für Kommunikationsforschung und Phonetik (IKP), Universität Bonn (Germany)
Thomas Portele, Institut für Kommunikationsforschung und Phonetik (IKP), Universität Bonn (Germany)
In order to evaluate the prosodic output of a speech synthesis system independently from its segmental quality, we have developed a special way to delexicalize speech stimuli which we call PURR (Prosody Unveiling through Restricted Representation). We compared the use of PURR stimuli for the evaluation of prosodic naturalness in three different test designs: magnitude estimation (ME), categorical estimation (CE), and ranking order (RO). Sentences of different types were synthesized by six German synthesis systems. The synthetic utterances and one human voice were comparatively judged by experienced listeners. On the whole the results of all three methods are in good agreement. Choice of stimuli seems to be more important than the choice of method.
Richard Sproat, Bell Labs, Lucent Technologies (USA)
Andrew Hunt, Sun Microsystems, Inc (USA)
Mari Ostendorf, Boston University (USA)
Paul Taylor, CSTR, University of Edinburgh (U.K.)
Alan W. Black, CSTR, University of Edinburgh (U.K.)
Kevin Lenzo, Carnegie Mellon University (USA)
Mike Eddington, BT Labs (U.K.)
Currently, speech synthesizers are controlled by a multitude of proprietary tag sets. These tag sets vary substantially across synthesizers and are an inhibitor to the adoption of speech synthesis technology by developers. SABLE is an XML/SGML-based markup scheme for text-to-speech synthesis, developed to address the need for a common TTS control paradigm. This paper presents an overview of the SABLE v0.2 specification, and provides links to websites with further information on SABLE.
H. Timothy Bunnell, The duPont Hosp. for Children & Univ. of DE (USA)
Steve R. Hoskins, University of Delaware & duPont Hosp. for Children (USA)
Debra Yarrington, University of Delaware & duPont Hosp. for Children (USA)
The relative contributions of segmental versus prosodic factors to the perceived naturalness of synthetic speech was measured by transplanting prosody between natural speech and the output of a diphone synthesizer. A small corpus was created containing matched sentence pairs wherein one member of the pair was a natural utterance and the other was a synthetic utterance generated with diphone data from the same talker. Two additional sentences were formed from each sentence pair by transplanting the prosodic structure between the natural and synthetic members of each pair. In two listening experiments subjects were asked to (a) classify each sentence as "natural" or "synthetic, or (b) rate the naturalness of each sentence. Results showed that the prosodic information was more important than segmental information in both classification and ratings of naturalness.