Production

Home

Voice Characteristics Conversion for HMM-based Speech Synthesis System

Authors:

Takashi Masuko, P&I Lab., Tokyo Institute of Technology (Japan)
Keiichi Tokuda, Nagoya Institute of Technology (Japan)
Takao Kobayashi, P&I Lab., Tokyo Institute of Technology (Japan)
Satoshi Imai, P&I Lab., Tokyo Institute of Technology (Japan)

Volume 3, Page 1611

Abstract:

In this paper, we describe an approach to voice characteristics conversion for an HMM-based text-to-speech synthesis system. Since our speech synthesis system uses phoneme HMMs as speech units, voice characteristics conversion is achieved by changing HMM parameters appropriately. To transform the voice characteristics of synthesized speech to the target speaker, we applied MAP/VFS algorithm to the phoneme HMMs. Using 5 or 8 sentences as adaptation data, speech samples synthesized from a set of adapted tied triphone HMMs, which have approximately 2,000 distributions, are judged to be closer to the target speaker by 79.7% or 90.6%, respectively, in an ABX listening test.

ic971611.pdf

2671_a.wav Speech synthesized from initial speaker (MHT) models (all HMMs used in these demonstrations are tied triphone models which have approximately 2,000 distributions)
2671_b.wav Speech synthesized from adapted models (adapted using 2 sentences)
2671_c.wav Speech synthesized from adapted models (adapted using 5 sentences)
2671_d.wav Speech synthesized from adapted models (adapted using 8 sentences)
2671_e.wav Speech synthesized from target speaker (MMY) models

TOP

The Perceptual Importance of Selected Voice Quality Parameters

Authors:

Gudrun Klasmeyer, Berlin University of Technology (Germany)

Volume 3, Page 1615

Abstract:

It is well known, that personal voice qualities differ in the speakers use of temporal structures, F0 contours, articulation precision, vocal effort and type of phonation. Whereas temporal structures and F0 contours can be measured in the acoustic signal and conclusions about articulation precision can be made from the formant structure, this paper focuses especially on vocal effort and type of phonation. These voice quality percepts are a combination of several acoustic voice quality parameters: The glottal pulse shape in the time domain or damping of the harmonics in the frequency domain, spectral distribution of turbulent signal components and voicing irregularities. In an investigation on emotionally loaded speech material it could be shown, that the named acoustic parameters are useful for differentiating between the emotions happiness, sadness, anger, fear and boredom [Klasmeyer, 1996]. The perceptual importance of selected acoustic voice quality parameters is investigated in perception experiments with synthetic speech.

ic971615.pdf

TOP

A Parametric Three-Dimensional Model of the Vocal-Tract Based on MRI data

Authors:

Hani Yehia, ATRI, Kyoto (Japan)
Mark Tiede, ATR HIP, Kyoto (Japan)

Volume 3, Page 1619

Abstract:

In this paper, 24 three-dimensional (3D) vocal-tract (VT) shapes extracted from MRI data are used to derive a parametric model for the vocal-tract. The method is as follows: first, each 3D VT shape is sampled using a semi-cylindrical grid whose position is determined by reference points based on VT anatomy. After that, the VT projections onto each plane of the grid are represented by their two main components obtained via principal component analysis (PCA). PCA is once again used to parametrize the sequences of coefficients that represent the sections along the tract. It was verified that the first four components can explain about 90% of the total variance of the observed shapes. Following this procedure, 3D VT shapes are approximated by linear combinations of four 3D basis functions. Finally, it is shown that the four parameters of the model can be estimated from VT midsagittal profiles.

ic971619.pdf

TOP

Inverse Filter Approach to Pitch Modification: Application to Concatenative Synthesis of Female Speech

Authors:

Rashid Ansari, University of Illinois, Chicago (U.S.A.)

Volume 3, Page 1623

Abstract:

In this paper a new method for modifying the pitch of units of recorded female speech is described. This method was developed to overcome limitations in an otherwise promising technique called Residual-Excited Linear Prediction (RELP). In the new method, the stored speech unit is processed with a suitably shaped time-varying filter. The filtered signal is modified according to the required change in the fundamental frequency. The modified filtered signal is applied to the inverse of the above-mentioned prefilter. Based on observations of spectra of multiple recordings of the same speech unit at different pitch frequencies, the magnitude response of the inverse filter was chosen to have a significantly less peaky structure than that which is typically obtained in LPC. Speech modifications using this method were found to be superior in quality to those obtained by RELP, while at the same time being less sensitive than RELP to changes in pitch marking.

ic971623.pdf

TOP

Vowel amplitude variation during sentence production

Authors:

Helen M. Hanson, Sensimetrics Corp. (U.S.A.)

Volume 3, Page 1627

Abstract:

With the goal of synthesizing natural-sounding speech based on higher-level parameters, sources of vowel amplitude variation were studied for sentences having different prosodic patterns. Previous theoretical and experimental work has shown that sound pressure level (SPL) is proportional to subglottal pressure ($P_s$) on a log scale during production of sustained vowels. The current work is based on acoustic sound pressure signals and estimated $P_s$ signals recorded during the production of reiterant speech, which is closer to natural speech production and includes prosodic effects. The results show individual, and perhaps gender, differences in the relationship between SPL and $P_s$, and in the degree of vowel amplitude contrast between full and reduced vowels. However, a general trend among speakers is to use subglottal pressure to control vowel amplitude at sentence level and main prominences, and to use adjustments of glottal configuration to control vowel amplitude variations for reduced and non-nuclear full vowels. These results have implications not only for articulatory speech synthesis, but also for automatic speech recognition systems.

ic971627.pdf

TOP

Experiments in Female Voice Speech Synthesis Using a Parametric Articulatory Model

Authors:

Dong Bing Wei, University of Liverpool (U.K.)
Colin C. Goodyear, University of Liverpool (U.K.)

Volume 3, Page 1631

Abstract:

A parametric vocal tract model and a two dimensional articulatory parametric subspace for a female voice are presented. The parameters of the model, which determine the vocal tract shape can be found uniquely for VV transitions by mapping directly from f1 and f2 onto this subspace, while a modified technique involving f3 is available for voiced VC and CV diphones. The area functions of the vocal tract, generated by these parameters are used to drive a time-domain synthesiser. Synthesis to give female speech, copied from either male or female natural speech, may be performed.

ic971631.pdf

TOP

Aritculatory Speech Synthesis Using Diphone Units

Authors:

Andrew Richard Greenwood, JMU (U.K.)

Volume 3, Page 1635

Abstract:

Two different parametric models of the vocal tract have been developed. These have been used to obtain area functions for use in an articulatory synthesiser based on the Kelly-Lochbaum model. Random sampling of the geometric space spanned by the model has been performed to obtain a codebook for use in spectral copy synthesis. A dynamic programming search of this codebook produces intelligible synthetic speech, but the overall quality is limited by the density of codebook entries in articulatory space. To increase the coverage without significantly increasing the codebook size, a method of generating several small codebooks, each of which covers a small amount of acoustic space has been developed. By using codebooks which map the regions of acoustic space defined by voiced diphones, it has been possible to significantly improve the quality of the synthetic speech.

ic971635.pdf

TOP

An Auditory-Based Measure for Improved Phone Segment Conceatenation

Authors:

David T. Chappell, Duke University (U.S.A.)
John H.L. Hansen, Duke University (U.S.A.)

Volume 3, Page 1639

Abstract:

This paper describes a new auditory-based distance measure intended for use in a concatenated synthesis technique wherein time- and frequency-domain characteristics are used to perform natural-sounding speaker synthesis. Whereas most concatenation systems use large databases (often +100,000 units), we begin from a small, limited database (approx. 400 units) and use a new spectral distortion measure to aid in the selection of phones for optimal concatenation. At the transition between speech segments, the new auditory-based distance metric assesses perceived discontinuities in the frequency domain. The distortion measure, which employs the Carney auditory model, is used to select phones which minimize the perceived distortion between concatenated segments. Moreover, time- and frequency-domain methods can shape the prosodic and spectral characteristics of each speech segment. The final results demonstrate improved performance over standard concatenation methods applied to small databases.

ic971639.pdf

TOP

Correlation Based Speech Formant Recovery

Authors:

Douglas Nelson, Department of Defense (U.S.A.)

Volume 3, Page 1643

Abstract:

A new method for generating speech spectrograms is presented. This algorithm is based on an autocorrelation function whose parameters are chosen provide processing gain and formant resolution, while minimizing pitch artifacts in the spectrum. Crisp formants are produced, and the power ratio of the formants can be adjusted by pre-filtering the data The process is functionally equivalent to a time-smoothed, windowed Wigner distribution, in which the cross-terms normally associated with the Wigner distribution are greatly attenuated by the smoothing operation.

ic971643.pdf

TOP

The Modulation Spectrogram: In Pursuit of an Invariant Representation of Speech

Authors:

Steven Greenberg, ICSI / UC Berkeley (U.S.A.)
Brian E.D. Kingsbury, ICSI / UC Berkeley (U.S.A.)

Volume 3, Page 1647

Abstract:

Understanding the human ability to reliably process and decode speech across a wide range of acoustic conditions and speaker characteristics is a fundamental challenge for current theories of speech perception. Conventional speech representations such as the sound spectrogram emphasize many spectro-temporal details that are not directly germane to the linguistic information encoded in the speech signal and which consequently do not display the perceptual stability characteristic of human listeners. We propose a new representational format, the modulation spectrogram, that discards much of the spectro-temporal detail in the speech signal and instead focuses on the underlying, stable structure incorporated in the low-frequency portion of the modulation spectrum distributed across critical-band-like channels. We describe the representation and illustrate its stability with color-mapped displays and with results from automatic speech recognition experiments.

ic971647.pdf

TOP

From Vocalic Detection to Automatic Emergence of Vowel Systems

Authors:

François Pellegrino, IRIT (France)
Régine André-Obrecht, IRIT (France)

Volume 3, Page 1651

Abstract:

This paper presents our work on vowel system detection as part of a project of Automatic Language Identification using phonological typologies. We have developed a vowel detection algorithm based on spectral analysis of the acoustic signal and requiring no learning stage. It has been tested with two telephone speech corpora: - with a French corpus provided by the CNET, 7.4 % of detections are false while about 25 % of the vowels present in the signal are not found. - experiments with 5 languages of the OGI_TS corpus result in 88.1 % of correct detection and about 15 % of non-detection. We also present in this paper the Vector Quantizer (VQ) LBG-Rissanen Algorithm that we use for vowel system modeling. Preliminary experiments are reported.

ic971651.pdf

TOP

Acoustic characteristics of lexical stress in continuous speech

Authors:

David van Kuijk, Nijmegen University (The Netherlands)
Louis Boves, Nijmegen University (The Netherlands)

Volume 3, Page 1655

Abstract:

In this paper we investigate acoustic differences between vowels in syllables that do or don't carry lexical stress. The speech material on which the investigation is based differs from the type of material used in previous research: we used phonetically rich sentences from the Dutch POLYPHONE corpus. We shortly discuss the definition of the linguistic feature `lexical stress' and its possible impact on the phonetic realization. We then proceed to explain the experiments that were carried out and the presentation of the results. Although most of the Duration, Energy and Spectral Tilt features that we used in the investigation show statistically significant differences for the population means for stressed and unstressed vowels, it also appears that the distributions overlap to such an extent that automatic detection of stressed and unstressed syllables yields accuracy scores of not much more han 65%. It is argued that this is due to the large variety in the ways in which the abstract linguistic feature `lexical stress' is realized in the acoustic speech signal.

ic971655.pdf

TOP

Pole-Zero Modeling of Vocal Tract for Fricative Sounds

Authors:

Minsheng Liu, University of Frankfurt (Germany)
Arild Lacroix, University of Frankfurt (Germany)

Volume 3, Page 1659

Abstract:

This paper presents a pole-zero model based on a multi-tube acoustic model for fricative sounds. This model consists of the front and back cavity formed by oral tract and pharynx, in which the excitation source is located at the point of constriction. The transfer function of this model including poles and zeros is derived andits properties are investigated. Small losses such as viscous friction which is an important for the fracative sound in the vocal tract are considered and the results show, if the vocal tract is lossless, the numerator part of the pole-zero model is symmetric. The transfer function with small losses overcomes the limitation of the symmetry.This method is applied by employing the inverse filtering and an adaptive algorithm to analyse fricative sounds.

ic971659.pdf

TOP

Quantitative characterization of functional voice disorders using motion analysis of highspeed video and modeling

Authors:

Thomas Wittenberg, University of Erlangen (Germany)
Patrick Mergell, University of Erlangen (Germany)
Monika Tigges, University of Erlangen (Germany)
Ulrich Eysholdt, University of Erlangen (Germany)

Volume 3, Page 1663

Abstract:

A semiautomatic motion analysis software is used to extract elongation-time diagrams (trajectories) of vocal fold vibrations from digital highspeed video sequences. By combining digital image processing with biomechanical modeling we extract characteristic parameters such as phonation onset time and pitch. A modified two-mass model of the vocal folds is employed in order to fit the main features of simulated time series to those of the extracted trajectories. Due to the variation of the model parameters, general conclusions can be made about laryngeal dysfunctions such as functional dysphonia. We show the first results of semi-automatic motion analysis in combination with model simulations as a step towards a computer aided diagnosis of voice disorders.