Text-To-Speech Synthesis 5

This paper describes a method for selecting units from a database of recorded speech, for use in a concatenative speech synthesiser. The simplest approach is to store one example of every possible unit. A more powerful method is to have multiple examples of each unit. The challenge for such a method is to provide an efficient means of selecting units from a practical inventory, to give the best approximation to the desired sequence in some clearly specified way. The method used in BT's Laureate system uses mixed N-phone units. In theory such units could be of arbitrary size, but in practice they are constrained to a maximum of three phones. It dynamically generates the unit sequence based on a global cost. Units are selected using purely phonologically motivated criteria, without reference to acoustic features, either desired or available within the inventory.

SL980389.PDF (From Author) SL980389.PDF (Rasterized)

TOP

A Synthesis Method Based on Concatenation of Demisyllables and a Residual Excited Vocal Tract Model

Authors:

Steve Pearson, Panasonic Technologies, Inc./Speech Technology Lab (USA)
Nick Kibre, Panasonic Technologies, Inc./Speech Technology Lab (USA)
Nancy Niedzielski, Panasonic Technologies, Inc./Speech Technology Lab (USA)

Page (NA) Paper number 648

Abstract:

This paper describes the back-end of a new, flexible, high-quality TTS system. Preliminary results have demonstrated a highly natural and intelligible output. Although the system follows some standard methodologies, such as concatenation, we have introduced a number of novel features and a combination of techniques that make our system unique. We will describe in detail many of the design decisions and compare them with other known systems. A demonstration of the speech quality with implanted prosody is available in waveform file ([WAVE stltts1.wav and stltts2.wav]) on the conference CD.

SL980648.PDF (From Author) SL980648.PDF (Rasterized)

0648_01.WAV (was: 0648_01.wav)	audio file demonstrating quality of synthesis method File type: Sound File Format: Sound File: WAV Tech. description: sample rate = 11025, mono, linear encoding, 16 bits per sample Creating Application:: Unknown Creating OS: Windows95
0648_02.WAV (was: 0648_02.wav)	audio file demonstrating quality of synthesis method File type: Sound File Format: Sound File: WAV Tech. description: sample rate = 11025, mono, linear encoding, 16 bits per sample Creating Application:: Unknown Creating OS: Windows95

TOP

Exploration of Acoustic Correlates in Speaker Selection for Concatenative Synthesis

Authors:

Ann K. Syrdal, AT&T Labs -- Research (USA)
Alistair Conkie, AT&T Labs -- Research (USA)
Yannis Stylianou, AT&T Labs -- Research (USA)

Page (NA) Paper number 882

Abstract:

It is often difficult to determine the suitability of a speaker to serve as a model for concatenative text-to-speech synthesis. The perceived quality of a speaker's natural voice is not necessarily predictive of its synthetic quality. The selection of female and male speakers on whom to base two synthetic voices for the new AT&T text-to-speech system was made empirically. Brief readings of identical text materials were recorded from professional speakers. Small-scale TTS systems were constructed with a minimal diphone inventory, suitable for synthesizing a limited number of test sentences. Synthesized sentences and their naturally spoken references were presented to listeners in a formal listening evaluation. In addition, a variety of acoustic measurements of the speakers were made in order to determine which acoustic characteristics correlated with subjective synthesis quality. The results have implications both for speaker selection and for improving concatenative synthesis methods.

SL980882.PDF (From Author) SL980882.PDF (Rasterized)

TOP

A Perceptual Evaluation of Distance Measures for Concatenative Speech Synthesis

Authors:

Johan Wouters, Center for Spoken Language Understanding (USA)
Michael W. Macon, Center for Spoken Language Understanding (USA)

Page (NA) Paper number 905

Abstract:

In concatenative synthesis, new utterances are created by concatenating segments (units) of recorded speech. When the segments are extracted from a large speech corpus, a key issue is to select segments that will sound natural in a given phonetic context. Distance measures are often used for this task. However, little is known about the perceptual relevance of these measures. More insight into the relationship between computed distances and perceptual differences is needed to develop accurate unit selection algorithms, and to improve the quality of the resulting computer speech. In this paper, we develop a perceptual test to measure subtle phonetic differences between speech units. We use the perceptual data to evaluate several popular distance measures. The results show that distance measures that use frequency warping perform better than those that do not, and minimal extra advantage is gained by using weighted distances or delta features.

SL980905.PDF (From Author) SL980905.PDF (Rasterized)

TOP

HMM-Based Smoothing For Concatenative Speech Synthesis

Authors:

Mike Plumpe, Microsoft Research (USA)
Alex Acero, Microsoft Research (USA)
Hsiao-Wuen Hon, Microsoft Research (USA)
Xuedong Huang, Microsoft Research (USA)

Page (NA) Paper number 908

Abstract:

This paper will focus on our recent efforts to further improve the acoustic quality of the Whistler Text-to-Speech engine. We have developed an advanced smoothing system that a small pilot study indicates significantly improves quality. We represent speech as being composed of a number of frames, where each frame can be synthesized from a parameter vector. Each frame is represented by a state in an HMM, where the output distribution of each state is a Gaussian random vector consisting of x and Dx. The set of vectors that maximizes the HMM probability is the representation of the smoothed speech output. This technique follows our traditional goal of developing methods whose parameters are automatically learned from data with minimal human intervention. The general framework is demonstrated to be robust by maintaining improved quality with a significant reduction in data.

SL980908.PDF (From Author) SL980908.PDF (Rasterized)

TOP

A Nonlinear Unit Selection Strategy for Concatenative Speech Synthesis Based on Syllable Level Features

Authors:

Martin Holzapfel, SIEMENS AG (Germany)
Nick Campbell, ATR ITL (Japan)

Page (NA) Paper number 521

Abstract:

This paper describes an improved algorithm, motivated by fuzzy logic theory, for the selection of speech segments for concatenative synthesis from a huge database. Triphone HMM clustering is employed as an adaptive measure for articulatory similarity within a given database. Stress level contours are evaluated in the context of their surrounding vocalic peaks. The algorithm uses a beam search technique to optimise the suitability of each candidate unit to realise the desired target as well as continuity in concatenation.

Text-To-Speech Synthesis 5

Authors:

Page (NA) Paper number 389

Abstract:

Authors:

Page (NA) Paper number 648

Abstract:

(was: 0648_01.wav)

(was: 0648_02.wav)

Authors:

Page (NA) Paper number 882

Abstract:

Authors:

Page (NA) Paper number 905

Abstract:

Authors:

Page (NA) Paper number 908

Abstract:

Authors:

Page (NA) Paper number 521

Abstract: