Session W1B Speech Production Modelling

Chairperson Michael B. Riley AT&T Labs, USA

Home


VOICE CONVERSION BY CODEBOOK MAPPING OF LINE SPECTRAL FREQUENCIES AND EXCITATION SPECTRUM

Authors: Levent M. Arslan and Drzvid Talkin

Entropic Research Laboratory, Washington, DC, 20003

Volume 3 pages 1347 - 1350

ABSTRACT

This paper presents a new scheme for developing a voice conversion system that modifies the utterance of a source speaker to sound like speech from a target speaker. We refer to the method as Speaker Transformation Algorithm using Segmental Codebooks (STASC). Two new methods are described to perform the transformation of vocal tract and glottal excitation characteristics across speakers. In addition, the source speaker's general prosodic characteristics are modified using time-scale and pitch-scale modification algorithms. Informal listening tests suggest that convincing voice conversion is achieved while maintaining high speech quality. The performance of the proposed system is also evaluated on a standard Caussian mixture model based speaker identification system, and the results show that the transformed speech is assigned higher likelihood by the target speaker model when compared to the source model.

A0030.pdf

TOP


OPTIMAL STATE DEPENDENT SPECTRAL REPRESENTATION FOR HMM MODELING : A NEW THEORETICAL FRAMEWORK

Authors: C. Mokbel*, G. Gravier** and G. Chollet**

*France Telecom - CNET - DIH/RCP, 2 av. Pierre Marzin, 22307 Lannion, France ** ENST, dept Signal, 46 rue Barrault, 75634 Paris cedex 13, France *Tel. +33 2 96 OS 39 28, FAX: +33 2 96 OS 35 30, E-mail : mokbel@lannion.cnet.fr

Volume 3 pages 1351 - 1354

ABSTRACT

In this paper we propose a theoretical framework to extend classical continuous density HMM in order to consider different spectral representations depending on the state. We stress the need for a reference space and for spectral transformations between the model spectral representation spaces and the reference space. We show that this framework permits to obtain more precise pdfs in the reference space. Preliminary speech recognition experiments for two spectral representations MFCC and linear frequency scale cepstral coefficients show no improvements ; however they identify that the choice of the spectral representations is crucial and the determination of the spaces transformations is a complex problem

A0130.pdf

TOP


SPEECH ANALYSIS AND SYNTHESIS USING AN AM-FM MODULATION MODEL

Authors: Alexandros Potamianos (1) and Petros Maragos (2)

(1) AT&T Labs-Research, 180 Park Ave, P.O. Box 971, Florham Park, NJ 07932-0971, U.S.A. (2) Institute for Language & Speech Processing, Margari 22, Athens 11525, Greece and School of E.C.E, Georgia Institute of Technology, Atlanta, GA 30332, USA.

Volume 3 pages 1355 - 1358

ABSTRACT

In this paper, the AM-FM modulation model is applied to speech analysis, synthesis and coding. The multiband demodulation pitch tracking algorithm is proposed that produces smooth and accurate fundamental frequency contours. The AM-FM modulation vocoder represents speech as the sum of resonance signals modeled by their amplitude envelope and instantaneous frequency signals. Eficient modeling and coding (at 4.8-9.6 kbits/sec) algorithms are proposed for the amplitude envelope and instantaneous frequency signals. Amplitude and frequency modulations of the speech resonances are shown to be perceptually important for natural speech synthesis.

A0184.pdf

TOP


SYNTHESIS OF FRICATIVE CONSONANTS BY AUDIOVISUAL-TO-ARTICULATORY INVERSION

Authors: Khaled Mawass, Pierre Badin & Gérard Bailly

Institut de la Communication Parlée 46, Av. Félix Viallet, F-38031 Grenoble cedex 01, France Tel.: +33 (0)4- 76.57.48.26 -- Fax: +33 (0)4- 76.57.47.10, E-mail: mawass@icp.grenet.fr

Volume 3 pages 1359 - 1362

ABSTRACT

We present here results of audio-visual to articulatory inversion for French fricatives embedded into VCVs. The inversion technique is evaluated using both experimental and synthetic data. The final synthesis is assessed by a perceptual categorisation test. Synthetic stimuli have similar scores as natural ones.

A0287.pdf

Recordings

TOP


NEW TRANSFORMATIONS OF CEPSTRAL PARAMETERS FOR AUTOMATIC VOCAL TRACT LENGTH NORMALIZATION IN SPEECH RECOGNITION

Authors: Tom Claes , Ioannis Dologlou , Louis ten Bosch , Dirk Van Compernolle

K.U.Leuven - E.S.A.T., Kardinaal Mercierlaan 94, B-3001 Heverlee, Belgium Lernout & Hauspie Speech Products, Koning Albert I laan 64, B-1780 Wemmel, Belgium

Volume 3 pages 1363 - 1366

ABSTRACT

This paper proposes a method to transform acoustic models (HMM gaussian mixtures) that have been trained on a certain group of speakers for use on speech from a different group of speakers. Cepstral features are transformed on the basis of assumptions regarding the difference in vocal tract length (VTL) between the groups of speakers (VTL normalisation, VTLN). Firstly, the VTL of these groups has been estimated based on the average third formant F . Secondly, the linear acoustic theory of speech production has been applied to warp the spectral characteristics of the existing models so as to match the incoming speech. The mapping is composed of subsequent non-linear submappings. By locally linearizing it, a linear approximation was obtained which is accurate as long as warping is reasonably small. The method has been tested for the TI digits database, containing adult and kids speech, consisting of isolated digits and digit strings of different length. The word error rate when trained on adults and tested on kids with transformed adult models is decreased by more than a factor of 2 compared to the non-transformed case.

A0364.pdf

TOP


A MULTIRESOLUTIONALLY ORIENTED APPROACH FOR DETERMINATION OF CEPSTRAL FEATURES IN SPEECH RECOGNITION

Authors: S. Dobrisek & F. Mihelic & N. Pavesic

e-mail: simond@fe.uni-lj.si University of Ljubljana, Faculty of Electrical Engineering Trzaska cesta 25, SI-1000 Ljubljana SLOVENIA

Volume 3 pages 1367 - 1370

ABSTRACT

This paper presents an effort to provide a more efficient speech signal representation, which aims to be incorporated into an automatic speech recognition system. Modified cepstral coefficients, derived from a multiresolution auditory spectrum are proposed. The multiresolution spectrum was obtained using sliding single point discrete Fourier transformations. It is shown that the obtained spectrum values are similar to the results of a nonuniform filtering operation. The presented cepstral features are evaluated by introducing them into a simple phone recognition system.

A1354.pdf

TOP