Full List of Titles 1: ICSLP'98 Proceedings 2: SST Student Day Author Index A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Multimedia Files |
A Phonologically Motivated Method of Selecting Non-Uniform UnitsAuthors:
Andrew P. Breen, BT Labs (U.K.)
Page (NA) Paper number 389Abstract:This paper describes a method for selecting units from a database of recorded speech, for use in a concatenative speech synthesiser. The simplest approach is to store one example of every possible unit. A more powerful method is to have multiple examples of each unit. The challenge for such a method is to provide an efficient means of selecting units from a practical inventory, to give the best approximation to the desired sequence in some clearly specified way. The method used in BT's Laureate system uses mixed N-phone units. In theory such units could be of arbitrary size, but in practice they are constrained to a maximum of three phones. It dynamically generates the unit sequence based on a global cost. Units are selected using purely phonologically motivated criteria, without reference to acoustic features, either desired or available within the inventory.
|
0648_01.WAV(was: 0648_01.wav) | audio file demonstrating quality of synthesis method File type: Sound File Format: Sound File: WAV Tech. description: sample rate = 11025, mono, linear encoding, 16 bits per sample Creating Application:: Unknown Creating OS: Windows95 |
0648_02.WAV(was: 0648_02.wav) | audio file demonstrating quality of synthesis method File type: Sound File Format: Sound File: WAV Tech. description: sample rate = 11025, mono, linear encoding, 16 bits per sample Creating Application:: Unknown Creating OS: Windows95 |
Ann K. Syrdal, AT&T Labs -- Research (USA)
Alistair Conkie, AT&T Labs -- Research (USA)
Yannis Stylianou, AT&T Labs -- Research (USA)
It is often difficult to determine the suitability of a speaker to serve as a model for concatenative text-to-speech synthesis. The perceived quality of a speaker's natural voice is not necessarily predictive of its synthetic quality. The selection of female and male speakers on whom to base two synthetic voices for the new AT&T text-to-speech system was made empirically. Brief readings of identical text materials were recorded from professional speakers. Small-scale TTS systems were constructed with a minimal diphone inventory, suitable for synthesizing a limited number of test sentences. Synthesized sentences and their naturally spoken references were presented to listeners in a formal listening evaluation. In addition, a variety of acoustic measurements of the speakers were made in order to determine which acoustic characteristics correlated with subjective synthesis quality. The results have implications both for speaker selection and for improving concatenative synthesis methods.
Johan Wouters, Center for Spoken Language Understanding (USA)
Michael W. Macon, Center for Spoken Language Understanding (USA)
In concatenative synthesis, new utterances are created by concatenating segments (units) of recorded speech. When the segments are extracted from a large speech corpus, a key issue is to select segments that will sound natural in a given phonetic context. Distance measures are often used for this task. However, little is known about the perceptual relevance of these measures. More insight into the relationship between computed distances and perceptual differences is needed to develop accurate unit selection algorithms, and to improve the quality of the resulting computer speech. In this paper, we develop a perceptual test to measure subtle phonetic differences between speech units. We use the perceptual data to evaluate several popular distance measures. The results show that distance measures that use frequency warping perform better than those that do not, and minimal extra advantage is gained by using weighted distances or delta features.
Mike Plumpe, Microsoft Research (USA)
Alex Acero, Microsoft Research (USA)
Hsiao-Wuen Hon, Microsoft Research (USA)
Xuedong Huang, Microsoft Research (USA)
This paper will focus on our recent efforts to further improve the acoustic quality of the Whistler Text-to-Speech engine. We have developed an advanced smoothing system that a small pilot study indicates significantly improves quality. We represent speech as being composed of a number of frames, where each frame can be synthesized from a parameter vector. Each frame is represented by a state in an HMM, where the output distribution of each state is a Gaussian random vector consisting of x and Dx. The set of vectors that maximizes the HMM probability is the representation of the smoothed speech output. This technique follows our traditional goal of developing methods whose parameters are automatically learned from data with minimal human intervention. The general framework is demonstrated to be robust by maintaining improved quality with a significant reduction in data.
Martin Holzapfel, SIEMENS AG (Germany)
Nick Campbell, ATR ITL (Japan)
This paper describes an improved algorithm, motivated by fuzzy logic theory, for the selection of speech segments for concatenative synthesis from a huge database. Triphone HMM clustering is employed as an adaptive measure for articulatory similarity within a given database. Stress level contours are evaluated in the context of their surrounding vocalic peaks. The algorithm uses a beam search technique to optimise the suitability of each candidate unit to realise the desired target as well as continuity in concatenation.