Session Th4C Voice Conversion and Data Driven F0-Models

Chairperson Yoshinori Sagisaka, ATR Interpret Telecom. Res. Labs., Japan

Home


Application-dependent prosodic models for Text-to-Speech synthesis and automatic design of learning database corpus using Genetic Algorithm

Authors: O. Boëffard and F. Emerard

France Telecom - CNET DIH/RCP, 2 avenue Pierre Marzin, 22307 Lannion - France E-mail: boeffard@lannion.cnet.fr

Volume 5 pages 2507 - 2510

ABSTRACT

The quality improvement of a Text-To-Speech synthesis system is usually considered as the arduous task of converting any text into speech. This paper is related to the work led at CNET in building application-oriented text-to-speech systems. For a majority of vocal services, the delivered messages have a strong syntactic constraint and use a limited vocabulary. We consider that, with our system, the most hopeful improvements in the overall quality of the speech synthesis signal are linked to the linguistic and prosodic processing. Discarding here segmental problems of the synthetic speech signal, the actual prosodic patterns are judged as too monotonous to allow a great diversity of vocal services. Thus, the actual effort deals with the development of automatic systems to adapt the parameters of statistical prosodic models to a specific speaker's voice under the constraint of a limited amount of different syntactic structures. This work presents an automatic system to build "optimal" training databases used to learn the models' parameters. The formulation of the problem is defined as a set covering problem and is solved using genetic algorithms. Both an objective and a subjective evaluation show the usefulness of this approach.

A0056.pdf

TOP


COMBINATORIAL ISSUES IN TEXT-TO-SPEECH SYNTHESIS

Authors: Jan P. H. van Santen

Lucent Technologies – Bell Labs, 600 Mountain Ave., Murray Hill, NJ 07974, U.S.A. jphvs@research.bell-labs.com

Volume 5 pages 2511 - 2514

ABSTRACT

Enhanced storage capacities and new learning algorithms have increased the role of text and speech training data bases in the construction of text-to-speech systems. It has become apparent, however, that not always learning algorithms are available that have strong generalization capabilities – the ability to generalize from cases seen in the training data base to new cases encountered during TTS operation. This makes it important to measure and understand the degree of coverage of the input domain of a text-to-speech system (usually, the entire language) by a given training data base. The goal of this paper is to investigate the feasibility of coverage in several domains of interest for TTS. It is shown that, as a result of the combinatorics of language, coverage is typically quite disappointing. This puts a premium on the generalization capability of learning algorithms.

A0446.pdf

TOP


AUTOMATIC CORPUS-BASED TRAINING OF RULES FOR PROSODIC GENERATION IN TEXT-TO-SPEECH

Authors: Eduardo López-Gonzalo, Jose M. Rodríguez-García, Luis Hernández-Gómez and Juan M. Villar

E.T.S.I. de Telecomunicación. Univ. Politécnica Madrid Dep. Señales, Sistemas y Radiocomunicaciones. Ciudad Universitaria. 28040-Madrid (Spain). Tel:34.1.5495700. Fax:34.1.3367350. e-mail: eduardo@gaps.ssr.upm.es

Volume 5 pages 2515 - 2518

ABSTRACT

In this paper, we discuss a methodology for automatic prosodic modeling in Text-to-Speech (TTS) systems. The proposed methodology can be seen as a data-driven strategy to train prosodic rules from the automatic analysis of a specific text and its related speech material. Therefore, our corpus-based training procedure is based on an automatic linguistic analysis of the text and on an acoustic analysis of the speech using automatic speech recognition techniques. Together with the automatic derivation of prosodic rules, our method can be easily extended to obtain specific grammar categories suitable for accurate prosodic modeling of specific tasks. Evaluation results over two different applications and speaker styles, reveal that the proposed automatic prosodic generation procedure is able to provide a noticeable increase in naturalness when adapting TTS system to a new speaker and a new speaking style.

A0784.pdf

TOP


Hidden Markov Model Based Voice Conversion Using Dynamic Characteristics of Speaker

Authors: Eun-Kyoung Kim, Sangho Lee and Yung-Hwan Oh

Department of Computer Science Korea Advanced Institute of Science and Technology 373-1, Kusung-dong, Yusong-gu, Taejon, KOREA. E-mail: ekkim@bulsai.kaist.ac.kr

Volume 5 pages 2519 - 2522

ABSTRACT

This paper proposes a new voice conversion technique based on hidden Markov model (HMM) for modeling of speaker's dynamic characteristics. The basic idea of this technique is to use state transition probability as speaker's dynamic characteristics and have conversion rule at each state of HMM. A couple of methods is developed for creating state-dependent conversion rule. One uses source speaker's spectral dynamics and the other uses target speaker's. The experimental results showed that the proposed methods have better performance than conventional VQ-method in both objective and subjective tests. The comparison of our two methods showed that the method using target speaker's dynamics is superior in listening test and produces more natural sound.

A0891.pdf

TOP


SPEAKER INTERPOLATION IN HMM-BASED SPEECH SYNTHESIS SYSTEM

Authors: Takayoshi Yoshimura 1 , Takashi Masuko 2 , Keiichi Tokuda 1 , Takao Kobayashi 2 and Tadashi Kitamura 1

1 Department of Computer Science, Nagoya Institute of Technology, Nagoya 466, Japan 2 Precision and Intelligence Laboratory, Tokyo Institute of Technology, Yokohama 226, Japan E-mail: yossie@ics.nitech.ac.jp, masuko@pi.titech.ac.jp, tokuda@ics.nitech.ac.jp, tkobayas@pi.titech.ac.jp, kitamura@ics.nitech.ac.jp

Volume 5 pages 2523 - 2526

ABSTRACT

This paper describes an approach tovoice characteristics conversion for HMM-based text-to-speech synthesis system using speaker interpolation. An HMM interpolation technique is derived from a probabilistic distance measure for HMMs, and used to synthesize speech with untrained speaker's characteristics by interpolating HMM parameters among some representative speakers' HMM sets. The results of subjective experiments show that we can gradually change the characteristics of synthesized speech from one's to the other's by changing the interpolation ratio.

A1015.pdf

Recordings

TOP


DESIGNING A SPEAKER ADAPTABLE FORMANT-BASED TEXT-TO-SPEECH SYSTEM

Authors: V.Darsinos, D.Galanis & G.Kokkinakis

Wire Communications Laboratory University of Patras, 26500 Patras, Greece

Volume 5 pages 2527 - 2530

ABSTRACT

First results of the efforts to build a formant Text- to-Speech system, capable to change its characteristics and imitate a specific speaker's voice, are presented. The designing procedure is based on the automatic analysis of phonetically labelled utterances of the speaker, for the automatic extraction of formant values, voice source characteristics and coarticulation rules. All these parameters are necessary to control the synthesizer. The results of preliminary listening tests are encouraging, indicating that the system can serve as an efficient tool for the automatic analysis of speaker voice characteristics and speaker imitation.

A1113.pdf

TOP