Authors:
Paul C. Bagshaw, France Telecom, CNET. (France)
Page (NA) Paper number 132
Abstract:
A new model of phone duration and energy is presented. These parameters
are modelled in two stages. The first stage builds a statistics tree
that contains phone duration and energy mean and standard deviation
values at each node. The branches of the tree are characterised by
a set of factors related to phonetic context. The second stage considers
phone duration and energy to be modified by two syllable-level prosodic
coefficients. The duration and energy of the phones of a syllable are
influenced to differing degrees by these coefficients. Weights are
associated with the different phone positions in a syllable. A simulated
annealing technique is used to find the set of weights that allow the
prosodic coefficients to be calculated for all syllables and, in turn,
minimise the error in predicting the phone duration and energy during
synthesis. They are predicted with a mean squared error of 15.4ms and
6.8dB respectively. During synthesis, the syllable-level prosodic coefficients
are predicted by regression trees from linguistic information. Manual
prosodic labelling is not required at any stage.
Authors:
Jerome R. Bellegarda, Apple Computer, Inc. (USA)
Kim E.A. Silverman, Apple Computer, Inc. (USA)
Page (NA) Paper number 135
Abstract:
Accurate duration modeling is necessary for synthetic speech to sound
natural. Over the past few years, the sums-of-products framework has
emerged as an effective way to account for contextual influences on
phoneme duration. This approach is generally applied after log-transforming
the durations. This paper presents empirical and theoretical evidence
which suggests that this transformation is not optimal. A promising
alternative solution is proposed, based on a root sinusoidal function.
Preliminary experimental results were obtained on over 50,000 phonemes
in varied prosodic contexts. Compared to the log transformation, this
new transformation reduced the proportion of standard deviation unexplained
by approximately 30%. Alternatively, for a given level of performance,
the root sinusoidal transformation roughly halved the number of regression
parameters required.
Authors:
Chilin Shih, Bell Laboratories, Lucent Technologies (USA)
Wentao Gu, Shanghai Jiaotong University (China)
Jan P.H. van Santen, Bell Laboratories, Lucent Technologies (USA)
Page (NA) Paper number 177
Abstract:
This paper discusses a methodology using a minimal set of sentences
to adapt an existing TTS duration model to capture inter-speaker variations.
The assumption is that the original duration database contains information
of both language-specific and speaker-specific duration characteristics.
In training a duration model for a new speaker, only the speaker-specific
information needs to be modeled, therefore the size of the training
data can be reduced drastically. Results from several experiments are
compared and discussed.
Authors:
Takayoshi Yoshimura, Nagoya Institute of Technology (Japan)
Keiichi Tokuda, Nagoya Institute of Technology (Japan)
Takashi Masuko, Tokyo Institute of Technology (Japan)
Takao Kobayashi, Tokyo Institute of Technology (Japan)
Tadashi Kitamura, Nagoya Institute of Technology (Japan)
Page (NA) Paper number 939
Abstract:
This paper proposes a new approach to state duration modeling for HMM-based
speech synthesis. A set of state durations of each phoneme HMM is
modeled by a multi-dimensional Gaussian distribution, and duration
models are clustered using a decision tree based context clustering
technique. In the synthesis stage, state durations are determined
by using the state duration models. In this paper, we take account
of contextual factors such as stress-related factors and locational
factors in addition to phone identity factors. Experimental results
show that we can synthesize good quality speech with natural timing,
and the speaking rate can be varied easily.
|