ICASSP '98 Main Page
 General Information
 Conference Schedule
 Technical Program

Overview
50th Annivary Events
Plenary Sessions
Special Sessions
Tutorials
Technical Sessions
Invited Speakers
 Registration
 Exhibits
 Social Events
 Coming to Seattle
 Satellite Events
 Call for Papers/ Author's Kit
 Future Conferences
 Help
|
Abstract - SP9 |
 |
SP9.1
|
TD-PSOLA versus Harmonic Plus Noise Model in Diphone Based Speech Synthesis
A. Syrdal,
Y. Stylianou,
L. Garrison,
A. Conkie,
J. Schroeter (AT&T Labs, USA)
In an effort to select a speech representation for our next generation concatenative text-to-speech synthesizer, the use of two candidates is investigated; TD-PSOLA and the Harmonic plus Noise Model, HNM. A formal listening test has been conducted and the two candidates have been rated regarding intelligibility, naturalness and pleasantness. Ability for database compression and computational load is also discussed. The results show that HNM consistently outperforms TD-PSOLA in all the above features except for computational load. HNM allows for high-quality speech synthesis without smoothing problems at the segmental boundaries and without buzziness or other oddities observed with TD-PSOLA.
|
SP9.2
|
A Hybrid Approach to Synthesize High Quality Cantonese Speech
M. Chu,
P. Ching (Chinese University of Hong Kong, P R China)
Synthesizing high quality speech necessitates an intelligent modification algorithm to adjust the important prosodic features of the pre-stored speech units to meet the desired requirements, such as smoothness, naturalness and pleasantness. TD-PSOLA is a simple but effective method of varying the pitch and time-scaling of speech and it can produce high quality synthetic output. However, when the prosodic pattern requires a drastic modification in the spectral content, TD-PSOLA often generates speech with reverberation. This paper develops a hybrid synthesis method based on TD-PSOLA and shape-invariant sinusoidal technique to alleviate the problem of reverberation. It is particularly useful for the generation of Cantonese speech, since it can cope with the rapidly changing of the pitch profile of Cantonese, which is a mono-syllabic and tonal language. The proposed method has been applied to construct a Cantonese synthesizer which is shown to be capable of producing high quality Cantonese speech without reverberation.
|
SP9.3
|
A System for Voice Conversion Based on Probabilistic Classification and a Harmonic Plus Noise Model
Y. Stylianou (AT&T Labs - Research, USA);
O. Cappé (ENST, France)
Voice conversion is defined as modifying the speech signal of one speaker (source speaker) so that it sounds as if it had been pronounced by a different speaker (target speaker). This paper describes a system for efficient voice conversion. A novel mapping function is presented which associates the acoustic space of the source speaker with the acoustic space of the target speaker. The proposed system is based on the use of a Gaussian Mixture Model, GMM, to model the acoustic space of a speaker and a pitch synchronous harmonic plus noise representation of the speech signal for prosodic modifications. The mapping function is a continuous parametric function which takes into account the probabilistic classification provided by the mixture model (GMM). Evaluation by objective tests showed that the proposed system was able to reduce the perceptual distance between the source and target speaker by 70%. Formal listening tests also showed that 97% of the converted speech was judged to be spoken from the target speaker while maintaining high speech quality.
|
SP9.4
|
Spectral Voice Conversion for Text-to-Speech Synthesis
A. Kain,
M. Macon (Oregon Graduate Institute, USA)
A new voice conversion algorithm that modifies a source speaker's speech to sound as if produced by a target speaker is presented. It is applied to a residual-excited LPC text-to-speech diphone synthesizer. Spectral parameters are mapped using a locally linear transformation based on Gaussian mixture models whose parameters are trained by joint density estimation. The LPC residuals are adjusted to match the target speaker's average pitch. To study effects of the amount of training on performance, data sets of varying sizes are created by automatically selecting subsets of all available diphones by a vector quantization method. In an objective evaluation, the proposed method in found to perform more reliably for small training sets than a previous approach. In perceptual tests, it was shown that nearly optimal spectral conversion performance was achieved, even with a small amount of training data. However, speech quality improved with increases in the training set size.
|
SP9.5
|
Speaker Transformation using Sentence HMM Based Alignments and Detailed Prosody Modification
L. Arslan,
D. Talkin (Entropic Research Lab, USA)
This paper presents several improvements to our voice conversion system which we refer to as Speaker Transformation Algorithm using Segmental Codebooks (STASC). First, a new concept, sentence HMM, is introduced for the alignment of speech waveforms sharing the same text. This alignment technique allows reliable and high resolution mapping between two speech waveforms. In addition, it is observed that energy and speaking rate differences between two speakers are not constant across all phonemes. Therefore a codebook based duration and energy scaling algorithm is proposed. Finally, a more detailed pitch modification is introduced that takes into account pitch range differences between source and target speakers in addition to mean pitch level differences. The proposed changes improved the quality of transformed speech. Subjective listening tests showed that for a male to male transformation intelligibility is maintained at the same level as natural speech after the speaker transformation.
|
SP9.6
|
Automatic Generation of Synthesis Units for Trainable Text-to-Speech Systems
H. Hon,
A. Acero,
X. Huang,
J. Liu,
M. Plumpe (Microsoft Research, USA)
Whistler Text-to-Speech engine was designed so that we can automatically construct the model parameters from training data. This paper will describe in detail the design issues of constructing the synthesis unit inventory automatically from recording databases. The automatic process includes (1) determining the scaleable synthesis unit which can reflect spectral variations of different allophones; (2) segmenting the recording sentences into phonetic segments; (3) select good instances for each synthesis unit to generate best synthesis sentence during run time. These processes are all derived through the use of probabilistic learning methods that are aimed at the same optimization criteria. Through the automatic unit generation, Whistler can automatically produce synthetic speech that sounds very natural and resembles the acoustic characteristics of the original speaker.
|
SP9.7
|
Optimizing a Neural Net for Speaker and Task Dependent Pitch Contour Generation
H. Ralf,
H. Martin (Siemens AG, Germany)
The generation of a pleasant pitch contour is an important issue for the naturalness of each TTS system. Till now the results are fare from being satisfactory. In this paper we present a speaker and task specific approach realized by a neural network. Personal and task specific characteristics are maintained and the demand of generalization decreases. So the results in application can significantly be improved. Using an optimized network structure global and well localized patterns can be covered and trained simultaneously within one network.Correlation analysis of the databae versus the sensitivity of the trained network validates the importance of distinctive parameters in training. Based on this comparison we give a discussion of the generalization properties of the nn trained speaker and task dependency.Finally a variation of the context range helps to find an optimizedtuning of the input parameter set.
|
SP9.8
|
Practical High-Quality Speech and Voice Synthesis Using Fixed Frame Rate ABS/OLA Sinusoidal Modeling
E. George (Texas Instruments, USA)
This paper describes algorithms developed to apply the Analysis-by-Synthesis/Overlap-Add (ABS/OLA) sinusoidal modeling system to real-time speech and singing voice synthesis. As originally proposed, the ABS/OLA system is limited to unidirectional time-scaling, and relies on variable frame length to accomplish time-scale modification. For speech and voice synthesis applications, unidirectional time scaling makes effective looping to produce sustained vocal sounds difficult, and variable frame length makes real-time polyphonic synthesis problematic. This paper presents a reformulation of the basic ABS/OLA system to deal with these issues, which is termed Fixed-Rate ABS/OLA (ABS/OLA-FR).
|
< Previous Abstract - SP8 |
SP10 - Next Abstract > |
|