Text-To-Speech Synthesis 5

Home
Full List of Titles
1: ICSLP'98 Proceedings
Keynote Speeches
Text-To-Speech Synthesis 1
Spoken Language Models and Dialog 1
Prosody and Emotion 1
Hidden Markov Model Techniques 1
Speaker and Language Recognition 1
Multimodal Spoken Language Processing 1
Isolated Word Recognition
Robust Speech Processing in Adverse Environments 1
Spoken Language Models and Dialog 2
Articulatory Modelling 1
Talking to Infants, Pets and Lovers
Robust Speech Processing in Adverse Environments 2
Spoken Language Models and Dialog 3
Speech Coding 1
Articulatory Modelling 2
Prosody and Emotion 2
Neural Networks, Fuzzy and Evolutionary Methods 1
Utterance Verification and Word Spotting 1 / Speaker Adaptation 1
Text-To-Speech Synthesis 2
Spoken Language Models and Dialog 4
Human Speech Perception 1
Robust Speech Processing in Adverse Environments 3
Speech and Hearing Disorders 1
Prosody and Emotion 3
Spoken Language Understanding Systems 1
Signal Processing and Speech Analysis 1
Spoken Language Generation and Translation 1
Spoken Language Models and Dialog 5
Segmentation, Labelling and Speech Corpora 1
Multimodal Spoken Language Processing 2
Prosody and Emotion 4
Neural Networks, Fuzzy and Evolutionary Methods 2
Large Vocabulary Continuous Speech Recognition 1
Speaker and Language Recognition 2
Signal Processing and Speech Analysis 2
Prosody and Emotion 5
Robust Speech Processing in Adverse Environments 4
Segmentation, Labelling and Speech Corpora 2
Speech Technology Applications and Human-Machine Interface 1
Large Vocabulary Continuous Speech Recognition 2
Text-To-Speech Synthesis 3
Language Acquisition 1
Acoustic Phonetics 1
Speaker Adaptation 2
Speech Coding 2
Hidden Markov Model Techniques 2
Multilingual Perception and Recognition 1
Large Vocabulary Continuous Speech Recognition 3
Articulatory Modelling 3
Language Acquisition 2
Speaker and Language Recognition 3
Text-To-Speech Synthesis 4
Spoken Language Understanding Systems 4
Human Speech Perception 2
Large Vocabulary Continuous Speech Recognition 4
Spoken Language Understanding Systems 2
Signal Processing and Speech Analysis 3
Human Speech Perception 3
Speaker Adaptation 3
Spoken Language Understanding Systems 3
Multimodal Spoken Language Processing 3
Acoustic Phonetics 2
Large Vocabulary Continuous Speech Recognition 5
Speech Coding 3
Language Acquisition 3 / Multilingual Perception and Recognition 2
Segmentation, Labelling and Speech Corpora 3
Text-To-Speech Synthesis 5
Spoken Language Generation and Translation 2
Human Speech Perception 4
Robust Speech Processing in Adverse Environments 5
Text-To-Speech Synthesis 6
Speech Technology Applications and Human-Machine Interface 2
Prosody and Emotion 6
Hidden Markov Model Techniques 3
Speech and Hearing Disorders 2 / Speech Processing for the Speech and Hearing Impaired 1
Human Speech Production
Segmentation, Labelling and Speech Corpora 4
Speaker and Language Recognition 4
Speech Technology Applications and Human-Machine Interface 3
Utterance Verification and Word Spotting 2
Large Vocabulary Continuous Speech Recognition 6
Neural Networks, Fuzzy and Evolutionary Methods 3
Speech Processing for the Speech-Impaired and Hearing-Impaired 2
Prosody and Emotion 7
2: SST Student Day
SST Student Day - Poster Session 1
SST Student Day - Poster Session 2

Author Index
A B C D E F G H I
J K L M N O P Q R
S T U V W X Y Z

Multimedia Files

A Phonologically Motivated Method of Selecting Non-Uniform Units

Authors:

Andrew P. Breen, BT Labs (U.K.)
Peter Jackson, BT Labs (U.K.)

Page (NA) Paper number 389

Abstract:

This paper describes a method for selecting units from a database of recorded speech, for use in a concatenative speech synthesiser. The simplest approach is to store one example of every possible unit. A more powerful method is to have multiple examples of each unit. The challenge for such a method is to provide an efficient means of selecting units from a practical inventory, to give the best approximation to the desired sequence in some clearly specified way. The method used in BT's Laureate system uses mixed N-phone units. In theory such units could be of arbitrary size, but in practice they are constrained to a maximum of three phones. It dynamically generates the unit sequence based on a global cost. Units are selected using purely phonologically motivated criteria, without reference to acoustic features, either desired or available within the inventory.

SL980389.PDF (From Author) SL980389.PDF (Rasterized)

TOP


A Synthesis Method Based on Concatenation of Demisyllables and a Residual Excited Vocal Tract Model

Authors:

Steve Pearson, Panasonic Technologies, Inc./Speech Technology Lab (USA)
Nick Kibre, Panasonic Technologies, Inc./Speech Technology Lab (USA)
Nancy Niedzielski, Panasonic Technologies, Inc./Speech Technology Lab (USA)

Page (NA) Paper number 648

Abstract:

This paper describes the back-end of a new, flexible, high-quality TTS system. Preliminary results have demonstrated a highly natural and intelligible output. Although the system follows some standard methodologies, such as concatenation, we have introduced a number of novel features and a combination of techniques that make our system unique. We will describe in detail many of the design decisions and compare them with other known systems. A demonstration of the speech quality with implanted prosody is available in waveform file ([WAVE stltts1.wav and stltts2.wav]) on the conference CD.

SL980648.PDF (From Author) SL980648.PDF (Rasterized)

0648_01.WAV
(was: 0648_01.wav)
audio file demonstrating quality of synthesis method
File type: Sound File
Format: Sound File: WAV
Tech. description: sample rate = 11025, mono, linear encoding, 16 bits per sample
Creating Application:: Unknown
Creating OS: Windows95
0648_02.WAV
(was: 0648_02.wav)
audio file demonstrating quality of synthesis method
File type: Sound File
Format: Sound File: WAV
Tech. description: sample rate = 11025, mono, linear encoding, 16 bits per sample
Creating Application:: Unknown
Creating OS: Windows95

TOP


Exploration of Acoustic Correlates in Speaker Selection for Concatenative Synthesis

Authors:

Ann K. Syrdal, AT&T Labs -- Research (USA)
Alistair Conkie, AT&T Labs -- Research (USA)
Yannis Stylianou, AT&T Labs -- Research (USA)

Page (NA) Paper number 882

Abstract:

It is often difficult to determine the suitability of a speaker to serve as a model for concatenative text-to-speech synthesis. The perceived quality of a speaker's natural voice is not necessarily predictive of its synthetic quality. The selection of female and male speakers on whom to base two synthetic voices for the new AT&T text-to-speech system was made empirically. Brief readings of identical text materials were recorded from professional speakers. Small-scale TTS systems were constructed with a minimal diphone inventory, suitable for synthesizing a limited number of test sentences. Synthesized sentences and their naturally spoken references were presented to listeners in a formal listening evaluation. In addition, a variety of acoustic measurements of the speakers were made in order to determine which acoustic characteristics correlated with subjective synthesis quality. The results have implications both for speaker selection and for improving concatenative synthesis methods.

SL980882.PDF (From Author) SL980882.PDF (Rasterized)

TOP


A Perceptual Evaluation of Distance Measures for Concatenative Speech Synthesis

Authors:

Johan Wouters, Center for Spoken Language Understanding (USA)
Michael W. Macon, Center for Spoken Language Understanding (USA)

Page (NA) Paper number 905

Abstract:

In concatenative synthesis, new utterances are created by concatenating segments (units) of recorded speech. When the segments are extracted from a large speech corpus, a key issue is to select segments that will sound natural in a given phonetic context. Distance measures are often used for this task. However, little is known about the perceptual relevance of these measures. More insight into the relationship between computed distances and perceptual differences is needed to develop accurate unit selection algorithms, and to improve the quality of the resulting computer speech. In this paper, we develop a perceptual test to measure subtle phonetic differences between speech units. We use the perceptual data to evaluate several popular distance measures. The results show that distance measures that use frequency warping perform better than those that do not, and minimal extra advantage is gained by using weighted distances or delta features.

SL980905.PDF (From Author) SL980905.PDF (Rasterized)

TOP


HMM-Based Smoothing For Concatenative Speech Synthesis

Authors:

Mike Plumpe, Microsoft Research (USA)
Alex Acero, Microsoft Research (USA)
Hsiao-Wuen Hon, Microsoft Research (USA)
Xuedong Huang, Microsoft Research (USA)

Page (NA) Paper number 908

Abstract:

This paper will focus on our recent efforts to further improve the acoustic quality of the Whistler Text-to-Speech engine. We have developed an advanced smoothing system that a small pilot study indicates significantly improves quality. We represent speech as being composed of a number of frames, where each frame can be synthesized from a parameter vector. Each frame is represented by a state in an HMM, where the output distribution of each state is a Gaussian random vector consisting of x and Dx. The set of vectors that maximizes the HMM probability is the representation of the smoothed speech output. This technique follows our traditional goal of developing methods whose parameters are automatically learned from data with minimal human intervention. The general framework is demonstrated to be robust by maintaining improved quality with a significant reduction in data.

SL980908.PDF (From Author) SL980908.PDF (Rasterized)

TOP


A Nonlinear Unit Selection Strategy for Concatenative Speech Synthesis Based on Syllable Level Features

Authors:

Martin Holzapfel, SIEMENS AG (Germany)
Nick Campbell, ATR ITL (Japan)

Page (NA) Paper number 521

Abstract:

This paper describes an improved algorithm, motivated by fuzzy logic theory, for the selection of speech segments for concatenative synthesis from a huge database. Triphone HMM clustering is employed as an adaptive measure for articulatory similarity within a given database. Stress level contours are evaluated in the context of their surrounding vocalic peaks. The algorithm uses a beam search technique to optimise the suitability of each candidate unit to realise the desired target as well as continuity in concatenation.

SL980521.PDF (From Author) SL980521.PDF (Rasterized)

TOP