Speech Synthesis

Home

Shape-Invariant Pitch and Time-Scale Modification of Speech by Variable Order Phase Interpolation

Authors:

Mat P. Pollard, University of Liverpool (U.K.)
Barry M.G. Cheetham, University of Liverpool (U.K.)
Colin C. Goodyear, University of Liverpool (U.K.)
Mike D. Edgington, B.T. Laboratories (U.K.)

Volume 2, Page 919

Abstract:

To preserve the waveform shape and perceived quality of pitch and time-scale modified sinusoidally modelled voiced speech, the phases of the sinusoids used to model the glottal excitation are made to add coherently at estimated pitch pulse locations. The glottal excitation is therefore made to resemble a pseudo-periodic impulse train, a quality essential for shape- invariance. Conventional methods attempt to maintain the coherence once per synthesis frame by interpolating the phase through a single modified pitch pulse location, a time where all excitation phases are assumed to be integer multiples of 2(pi). Whilst this is adequate for small degrees of modification, the coherence is lost when the required amount of modification is increased. This paper presents a technique which is capable of better preserving the impulse-like nature of the glottal excitation whilst allowing its phases to evolve slowly through time.

ic970919.pdf

TOP

A Chinese Text-to-Speech System Based on Part-of-Speech Analysis, Prosodic Modeling and Non-Uniform Units

Authors:

Fu-Chiang Chou, NTU (Taiwan)
Chiu-Yu Tseng, Academia Sinica (Taiwan)
Keh-Jiann Chen, Academia Sinica (Taiwan)
Lin-Shan Lee, Academia Sinica (Taiwan)

Volume 2, Page 923

Abstract:

This paper presents a new Chinese text-to-speech system that produces very natural and intelligible synthetic Mandarin speech based on part-of-speech analysis, prosodic modeling and non-uniform units. The distinguishing features and key technology for the system can be summarized as follows: (1) A text analysis module for word identification and tagging was developed based on part-of-speech modeling and using heuristic rules to achieve very high accuracy. (2) The required prosodic parameters for the synthetic speech are derived from a two-stage procedure. The prosodic structures of the input texts are first derived from a statistical model trained by a large speech database, and the prosodic parameters are then determined according to the structures. (3) A specially designed speech segments inventory constructed with non-uniform and pitch dependent units is used to improve the fluency and intelligibility of the system.

ic970923.pdf

TOP

Automatic Prosodic Modeling for Speaker and Task Adaptation in Text-to-Speech

Authors:

Eduardo Lopez-Gonzalo, ETSITeleco UPMadrid (Spain)
Jose M. Rodriguez-Garcia, ETSITeleco UPMadrid (Spain)
Luis Hernandez-Gomez, ETSITeleco UPMadrid (Spain)
Juan M. Villar, ETSITeleco UPMadrid (Spain)

Volume 2, Page 927

Abstract:

One of the most important demands for future TTS systems is their ability to improve naturalness when embedded in a particular task or application that requires a particular speaking style for a particular speaker. In this paper, we present a new prosodic modeling procedure for improving naturalness by adapting a TTS system to a new speaker and a new speaking style. The proposed procedure is an extension of our automatic data-driven methodology presented in [1], to model both fundamental frequency and segmental duration. Automatic linguistic and acoustic analysis are performed on both a task dependent text corpus and the recorded material from the selected speaker.

ic970927.pdf

TOP

Prosody Generation with a Neural Network: Weighing the importence of input parameters

Authors:

Gerit P. Sonntag, University of Bonn (Germany)
Thomas Portele, University of Bonn (Germany)
Barbara Heuft, Lernout and Hauspie (Belgium)

Volume 2, Page 931

Abstract:

As an alternative to synthesis-by-rule, the use of neural networks in speech synthesis has been successfully applied to prosody generation, yet it is not known precisely which input parameters are responsible for good results. The approach presented here tries to quantify the contribution of each input parameter. This is done first by comparing the mean errors of networks trained with only one parameter each and by looking at the performance of a group of networks where each lacks one parameter. In a second approach different networks were perceptually evaluated in a pair comparison test with synthesized stimuli.

ic970931.pdf

TOP

Evaluation of a speech synthesis method for nonlinear modeling of vocal folds vibration effect

Authors:

Hiroshi Ohmura, ETL (Japan)
Kazuyo Tanaka, ETL (Japan)

Volume 2, Page 935

Abstract:

In this paper, we present a new speech synthesis method for improving voice quality in parametric rule-based speech synthesis systems. We also describe the results of a preference test on speech wave reconstruction to confirm the performance of the proposed method. The method is based on the functional approximation of vocal tract resonance produced by nonlinear interaction between the glottis and the vocal tract. In the performance test, evaluators listen to two kinds of reconstructed speech samples: one is synthesized by the proposed method and the other is by an ordinary LPC(Linear predictive coding)-based method. The speech sample set used in this test contains 60 sentences uttered by four speakers. Results show that the proposed method is superior in its quality.

ic970935.pdf

TOP

Generation of F0 Contour using Stochastic Mapping and Vector Quantization Control Parameters

Authors:

Byeon Heo-Jin, KAIST (Korea)
Kim Yeon-Jun, KAIST (Korea)
Oh Yung-Hwan, KAIST (Korea)

Volume 2, Page 939

Abstract:

This paper introduces an F0 contour generation method for text-to-speech synthesis using stochastic mapping and vector quantization control parameters. This model uses a new F0 contour labelling scheme based on the RFC (Rise/Fall/Connection) model, which describes F0 contour patterns with seven F0 labels and three pause labels. This paper also suggests an efficient selection method for control parameters instead of using the mean values of the control parameters. We achieved 78.06% accuracy in the F0 label prediction and 95.87% accuracy in the pause label prediction using this model. The experimental results shows that synthesized speech using vector quantization control parameters is more natural than using the mean values of the feature parameters.

ic970939.pdf

TOP

Spectral Normalization Employing Hidden Markov Modeling of Line Spectrum Pair Frequencies

Authors:

Bryan L. Pellom, Duke University (U.S.A.)
John H.L. Hansen, Duke University (U.S.A.)

Volume 2, Page 943

Abstract:

This paper proposes a spectral normalization approach in which the acoustical qualities of an input speech waveform are mapped onto that of a desired neutral voice. Such a method can be effective in reducing the impact of speaker variability such as accent, stress, and emotion for speech recognition. In the proposed method, the transformation is performed by modeling the temporal characteristics of the Line Spectrum Pair (LSP) frequencies of the neutral voice using hidden Markov models. The overall approach is integrated into a pitch synchronous overlap and add (PSOLA) analysis/synthesis framework. The algorithm is objectively evaluated using a distance measure based on the log-likelihood of observing the input (or normalized input) speech given Gaussian mixture speaker models for both the input and desired neutral voice. Results using the Gaussian mixture model formulated criteria demonstrate consistent normalization using a 10 speaker database.

ic970943.pdf

TOP

Time Domain Technique For Pitch Modification And Robust Voice Transformation

Authors:

Rivarol Vergin, INRS-Telecom (Canada)
Douglas O'Shaughnessy, INRS-Telecom (Canada)
Azarshid Farhat, INRS-Telecom (Canada)

Volume 2, Page 947

Abstract:

Modification of speech a subject of major interest today, with numerous applications including text to speech synthesis. The basic mechanisms behind this process often consist of pitch-scale and time-scale modifications of speech. While giving generally good results, it remains in most of the cases that the same speaker can be associated with the original signal and its modified version, which limits the use of these techniques in some applications where disguising voices is necessary. These paper presents an approach to increase the possibilities of speech modifications while preserving most of the speech quality of the original signal.

ic970947.pdf

TOP

A New Fundamental Frequency Modification Algorithm with Transformation of Spectrum Envelope according to (F_0)

Authors:

Kimihito Tanaka, NTT HI Labs. (Japan)
Masanobu Abe, NTT HI Labs. (Japan)

Volume 2, Page 951

Abstract:

This paper proposes a new speech modification algorithm which makes it possible to change the fundamental frequency ((F_0)) while preserving high quality. One novel point of the algorithm is that the spectrum envelope is transformed according to amount of (F_0) modification. Based on a codebook mapping formulation, transformation rules are generated using speech data uttered in a different (F_0) range. The rules have two purposes: one is transforming the spectrum envelope of the low frequency band and the other is adjusting the balance between low band power and high band power. The proposed algorithm is applied to a text-to-speech system based on waveform concatenation, and good performance is confirmed by listening tests.

ic970951.pdf

TOP

Reliability Assessment And Evaluation Of Objectively Measured Descriptors For Perceptual Speaker Characterization

Authors:

Burhan F. Necioglu, Georgia Institute of Technology (U.S.A.)
Mark A. Clements, Georgia Institute of Technology (U.S.A.)
Thomas P. Barnwell, Georgia Institute of Technology (U.S.A.)

Volume 2, Page 955

Abstract:

With the more widespread use of lower bit rate speech coders, the evaluation of speaker recognizability becomes a major issue to be addressed as well as the evaluation of overall voice quality. Furthermore, subjective quality evaluation of speech coders may produce different results depending on the voice character of the speakers used in the evaluation process. It follows naturally that methods and procedures to characterize speakers perceptually must be devised. In this paper, we report on an enhanced set of objective descriptors of the speech waveform, assessing the reliability of their measurements as well as their merit in discriminating utterances from different speakers. Of the 45 measures presented, 35 have less than 10% RMS measurement error, and 25 of those have less than 5%

ic970955.pdf

TOP

Recent Improvements On Microsofts Trainable Text-To-Speech System - Whistler

Authors:

Xuedong Huang, Microsoft (U.S.A.)
Alex Acero, Microsoft (U.S.A.)
Hsiao-Wuen Hon, Microsoft (U.S.A.)
Yun-Cheng Ju, Microsoft (U.S.A.)
Jingsong Liu, Microsoft (U.S.A.)
Scott Meredith, Microsoft (U.S.A.)
Mike Plumpe, Microsoft (U.S.A.)

Volume 2, Page 959

Abstract:

Whistler Text-to-Speech engine was designed so that we can automatically construct the model parameters from training data. This paper will focus on recent improvements on prosody and acoustic modeling, which are all derived through the use of probabilistic learning methods. Whistler can produce synthetic speech that sounds very natural and resembles the acoustic and prosodic characteristics of the original speaker. The underlying technologies used in Whistler can significantly facilitate the process of creating generic TTS systems for a new language, a new voice, or a new speech style. Whisper TTS engine supports Microsoft Speech API and requires less than 3 MB of working memory.

ic970959.pdf

TOP

Automatic Generation of Speech Synthesis Units Based on Closed Loop Training

Authors:

Takehiko Kagoshima, Toshiba R&D Center (Japan)
Masami Akamine, Toshiba R&D Center (Japan)

Volume 2, Page 963

Abstract:

This paper proposes a new method for automatically generating speech synthesis units. A small set of synthesis units is selected from a large speech database by the proposed Closed-Loop Training method (CLT). Because CLT is based on the evaluation and minimization of the distortion caused by the synthesis process such as prosodic modification, the selected synthesis units are most suitable for synthesizers. In this paper, CLT is applied to a waveform concatenation based synthesizer, whose basic unit is CV/VC(diphone). It is shown that synthesis units can be efficiently generated by CLT from a labeled speech database with a small amount of computation. Moreover, the synthesized speech is clear and smooth even though the storage size of the waveform dictionary is small.