Text-To-Speech Synthesis 1

Home
Full List of Titles
1: ICSLP'98 Proceedings
Keynote Speeches
Text-To-Speech Synthesis 1
Spoken Language Models and Dialog 1
Prosody and Emotion 1
Hidden Markov Model Techniques 1
Speaker and Language Recognition 1
Multimodal Spoken Language Processing 1
Isolated Word Recognition
Robust Speech Processing in Adverse Environments 1
Spoken Language Models and Dialog 2
Articulatory Modelling 1
Talking to Infants, Pets and Lovers
Robust Speech Processing in Adverse Environments 2
Spoken Language Models and Dialog 3
Speech Coding 1
Articulatory Modelling 2
Prosody and Emotion 2
Neural Networks, Fuzzy and Evolutionary Methods 1
Utterance Verification and Word Spotting 1 / Speaker Adaptation 1
Text-To-Speech Synthesis 2
Spoken Language Models and Dialog 4
Human Speech Perception 1
Robust Speech Processing in Adverse Environments 3
Speech and Hearing Disorders 1
Prosody and Emotion 3
Spoken Language Understanding Systems 1
Signal Processing and Speech Analysis 1
Spoken Language Generation and Translation 1
Spoken Language Models and Dialog 5
Segmentation, Labelling and Speech Corpora 1
Multimodal Spoken Language Processing 2
Prosody and Emotion 4
Neural Networks, Fuzzy and Evolutionary Methods 2
Large Vocabulary Continuous Speech Recognition 1
Speaker and Language Recognition 2
Signal Processing and Speech Analysis 2
Prosody and Emotion 5
Robust Speech Processing in Adverse Environments 4
Segmentation, Labelling and Speech Corpora 2
Speech Technology Applications and Human-Machine Interface 1
Large Vocabulary Continuous Speech Recognition 2
Text-To-Speech Synthesis 3
Language Acquisition 1
Acoustic Phonetics 1
Speaker Adaptation 2
Speech Coding 2
Hidden Markov Model Techniques 2
Multilingual Perception and Recognition 1
Large Vocabulary Continuous Speech Recognition 3
Articulatory Modelling 3
Language Acquisition 2
Speaker and Language Recognition 3
Text-To-Speech Synthesis 4
Spoken Language Understanding Systems 4
Human Speech Perception 2
Large Vocabulary Continuous Speech Recognition 4
Spoken Language Understanding Systems 2
Signal Processing and Speech Analysis 3
Human Speech Perception 3
Speaker Adaptation 3
Spoken Language Understanding Systems 3
Multimodal Spoken Language Processing 3
Acoustic Phonetics 2
Large Vocabulary Continuous Speech Recognition 5
Speech Coding 3
Language Acquisition 3 / Multilingual Perception and Recognition 2
Segmentation, Labelling and Speech Corpora 3
Text-To-Speech Synthesis 5
Spoken Language Generation and Translation 2
Human Speech Perception 4
Robust Speech Processing in Adverse Environments 5
Text-To-Speech Synthesis 6
Speech Technology Applications and Human-Machine Interface 2
Prosody and Emotion 6
Hidden Markov Model Techniques 3
Speech and Hearing Disorders 2 / Speech Processing for the Speech and Hearing Impaired 1
Human Speech Production
Segmentation, Labelling and Speech Corpora 4
Speaker and Language Recognition 4
Speech Technology Applications and Human-Machine Interface 3
Utterance Verification and Word Spotting 2
Large Vocabulary Continuous Speech Recognition 6
Neural Networks, Fuzzy and Evolutionary Methods 3
Speech Processing for the Speech-Impaired and Hearing-Impaired 2
Prosody and Emotion 7
2: SST Student Day
SST Student Day - Poster Session 1
SST Student Day - Poster Session 2

Author Index
A B C D E F G H I
J K L M N O P Q R
S T U V W X Y Z

Multimedia Files

Unsupervised Training of Phone Duration and Energy Models for Text-to-Speech Synthesis

Authors:

Paul C. Bagshaw, France Telecom, CNET. (France)

Page (NA) Paper number 132

Abstract:

A new model of phone duration and energy is presented. These parameters are modelled in two stages. The first stage builds a statistics tree that contains phone duration and energy mean and standard deviation values at each node. The branches of the tree are characterised by a set of factors related to phonetic context. The second stage considers phone duration and energy to be modified by two syllable-level prosodic coefficients. The duration and energy of the phones of a syllable are influenced to differing degrees by these coefficients. Weights are associated with the different phone positions in a syllable. A simulated annealing technique is used to find the set of weights that allow the prosodic coefficients to be calculated for all syllables and, in turn, minimise the error in predicting the phone duration and energy during synthesis. They are predicted with a mean squared error of 15.4ms and 6.8dB respectively. During synthesis, the syllable-level prosodic coefficients are predicted by regression trees from linguistic information. Manual prosodic labelling is not required at any stage.

SL980132.PDF (From Author) SL980132.PDF (Rasterized)

TOP


Improved Duration Modeling of English Phonemes Using a Root Sinusoidal Transformation

Authors:

Jerome R. Bellegarda, Apple Computer, Inc. (USA)
Kim E.A. Silverman, Apple Computer, Inc. (USA)

Page (NA) Paper number 135

Abstract:

Accurate duration modeling is necessary for synthetic speech to sound natural. Over the past few years, the sums-of-products framework has emerged as an effective way to account for contextual influences on phoneme duration. This approach is generally applied after log-transforming the durations. This paper presents empirical and theoretical evidence which suggests that this transformation is not optimal. A promising alternative solution is proposed, based on a root sinusoidal function. Preliminary experimental results were obtained on over 50,000 phonemes in varied prosodic contexts. Compared to the log transformation, this new transformation reduced the proportion of standard deviation unexplained by approximately 30%. Alternatively, for a given level of performance, the root sinusoidal transformation roughly halved the number of regression parameters required.

SL980135.PDF (From Author) SL980135.PDF (Scanned)

TOP


Efficient Adaptation of TTS Duration Model to New Speakers

Authors:

Chilin Shih, Bell Laboratories, Lucent Technologies (USA)
Wentao Gu, Shanghai Jiaotong University (China)
Jan P.H. van Santen, Bell Laboratories, Lucent Technologies (USA)

Page (NA) Paper number 177

Abstract:

This paper discusses a methodology using a minimal set of sentences to adapt an existing TTS duration model to capture inter-speaker variations. The assumption is that the original duration database contains information of both language-specific and speaker-specific duration characteristics. In training a duration model for a new speaker, only the speaker-specific information needs to be modeled, therefore the size of the training data can be reduced drastically. Results from several experiments are compared and discussed.

SL980177.PDF (From Author) SL980177.PDF (Rasterized)

TOP


Duration Modeling For HMM-Based Speech Synthesis

Authors:

Takayoshi Yoshimura, Nagoya Institute of Technology (Japan)
Keiichi Tokuda, Nagoya Institute of Technology (Japan)
Takashi Masuko, Tokyo Institute of Technology (Japan)
Takao Kobayashi, Tokyo Institute of Technology (Japan)
Tadashi Kitamura, Nagoya Institute of Technology (Japan)

Page (NA) Paper number 939

Abstract:

This paper proposes a new approach to state duration modeling for HMM-based speech synthesis. A set of state durations of each phoneme HMM is modeled by a multi-dimensional Gaussian distribution, and duration models are clustered using a decision tree based context clustering technique. In the synthesis stage, state durations are determined by using the state duration models. In this paper, we take account of contextual factors such as stress-related factors and locational factors in addition to phone identity factors. Experimental results show that we can synthesize good quality speech with natural timing, and the speaking rate can be varied easily.

SL980939.PDF (From Author) SL980939.PDF (Rasterized)

TOP