Takashi Masuko, P&I Lab., Tokyo Institute of Technology (Japan)
Keiichi Tokuda, Nagoya Institute of Technology (Japan)
Takao Kobayashi, P&I Lab., Tokyo Institute of Technology (Japan)
Satoshi Imai, P&I Lab., Tokyo Institute of Technology (Japan)
In this paper, we describe an approach to voice characteristics conversion for an HMM-based text-to-speech synthesis system. Since our speech synthesis system uses phoneme HMMs as speech units, voice characteristics conversion is achieved by changing HMM parameters appropriately. To transform the voice characteristics of synthesized speech to the target speaker, we applied MAP/VFS algorithm to the phoneme HMMs. Using 5 or 8 sentences as adaptation data, speech samples synthesized from a set of adapted tied triphone HMMs, which have approximately 2,000 distributions, are judged to be closer to the target speaker by 79.7% or 90.6%, respectively, in an ABX listening test.
Gudrun Klasmeyer, Berlin University of Technology (Germany)
It is well known, that personal voice qualities differ in the speakers use of temporal structures, F0 contours, articulation precision, vocal effort and type of phonation. Whereas temporal structures and F0 contours can be measured in the acoustic signal and conclusions about articulation precision can be made from the formant structure, this paper focuses especially on vocal effort and type of phonation. These voice quality percepts are a combination of several acoustic voice quality parameters: The glottal pulse shape in the time domain or damping of the harmonics in the frequency domain, spectral distribution of turbulent signal components and voicing irregularities. In an investigation on emotionally loaded speech material it could be shown, that the named acoustic parameters are useful for differentiating between the emotions happiness, sadness, anger, fear and boredom [Klasmeyer, 1996]. The perceptual importance of selected acoustic voice quality parameters is investigated in perception experiments with synthetic speech.
Hani Yehia, ATRI, Kyoto (Japan)
Mark Tiede, ATR HIP, Kyoto (Japan)
In this paper, 24 three-dimensional (3D) vocal-tract (VT) shapes extracted from MRI data are used to derive a parametric model for the vocal-tract. The method is as follows: first, each 3D VT shape is sampled using a semi-cylindrical grid whose position is determined by reference points based on VT anatomy. After that, the VT projections onto each plane of the grid are represented by their two main components obtained via principal component analysis (PCA). PCA is once again used to parametrize the sequences of coefficients that represent the sections along the tract. It was verified that the first four components can explain about 90% of the total variance of the observed shapes. Following this procedure, 3D VT shapes are approximated by linear combinations of four 3D basis functions. Finally, it is shown that the four parameters of the model can be estimated from VT midsagittal profiles.
Rashid Ansari, University of Illinois, Chicago (U.S.A.)
In this paper a new method for modifying the pitch of units of recorded female speech is described. This method was developed to overcome limitations in an otherwise promising technique called Residual-Excited Linear Prediction (RELP). In the new method, the stored speech unit is processed with a suitably shaped time-varying filter. The filtered signal is modified according to the required change in the fundamental frequency. The modified filtered signal is applied to the inverse of the above-mentioned prefilter. Based on observations of spectra of multiple recordings of the same speech unit at different pitch frequencies, the magnitude response of the inverse filter was chosen to have a significantly less peaky structure than that which is typically obtained in LPC. Speech modifications using this method were found to be superior in quality to those obtained by RELP, while at the same time being less sensitive than RELP to changes in pitch marking.
Helen M. Hanson, Sensimetrics Corp. (U.S.A.)
With the goal of synthesizing natural-sounding speech based on higher-level parameters, sources of vowel amplitude variation were studied for sentences having different prosodic patterns. Previous theoretical and experimental work has shown that sound pressure level (SPL) is proportional to subglottal pressure ($P_s$) on a log scale during production of sustained vowels. The current work is based on acoustic sound pressure signals and estimated $P_s$ signals recorded during the production of reiterant speech, which is closer to natural speech production and includes prosodic effects. The results show individual, and perhaps gender, differences in the relationship between SPL and $P_s$, and in the degree of vowel amplitude contrast between full and reduced vowels. However, a general trend among speakers is to use subglottal pressure to control vowel amplitude at sentence level and main prominences, and to use adjustments of glottal configuration to control vowel amplitude variations for reduced and non-nuclear full vowels. These results have implications not only for articulatory speech synthesis, but also for automatic speech recognition systems.
Dong Bing Wei, University of Liverpool (U.K.)
Colin C. Goodyear, University of Liverpool (U.K.)
A parametric vocal tract model and a two dimensional articulatory parametric subspace for a female voice are presented. The parameters of the model, which determine the vocal tract shape can be found uniquely for VV transitions by mapping directly from f1 and f2 onto this subspace, while a modified technique involving f3 is available for voiced VC and CV diphones. The area functions of the vocal tract, generated by these parameters are used to drive a time-domain synthesiser. Synthesis to give female speech, copied from either male or female natural speech, may be performed.
Andrew Richard Greenwood, JMU (U.K.)
Two different parametric models of the vocal tract have been developed. These have been used to obtain area functions for use in an articulatory synthesiser based on the Kelly-Lochbaum model. Random sampling of the geometric space spanned by the model has been performed to obtain a codebook for use in spectral copy synthesis. A dynamic programming search of this codebook produces intelligible synthetic speech, but the overall quality is limited by the density of codebook entries in articulatory space. To increase the coverage without significantly increasing the codebook size, a method of generating several small codebooks, each of which covers a small amount of acoustic space has been developed. By using codebooks which map the regions of acoustic space defined by voiced diphones, it has been possible to significantly improve the quality of the synthetic speech.
David T. Chappell, Duke University (U.S.A.)
John H.L. Hansen, Duke University (U.S.A.)
This paper describes a new auditory-based distance measure intended for use in a concatenated synthesis technique wherein time- and frequency-domain characteristics are used to perform natural-sounding speaker synthesis. Whereas most concatenation systems use large databases (often +100,000 units), we begin from a small, limited database (approx. 400 units) and use a new spectral distortion measure to aid in the selection of phones for optimal concatenation. At the transition between speech segments, the new auditory-based distance metric assesses perceived discontinuities in the frequency domain. The distortion measure, which employs the Carney auditory model, is used to select phones which minimize the perceived distortion between concatenated segments. Moreover, time- and frequency-domain methods can shape the prosodic and spectral characteristics of each speech segment. The final results demonstrate improved performance over standard concatenation methods applied to small databases.
Douglas Nelson, Department of Defense (U.S.A.)
A new method for generating speech spectrograms is presented. This algorithm is based on an autocorrelation function whose parameters are chosen provide processing gain and formant resolution, while minimizing pitch artifacts in the spectrum. Crisp formants are produced, and the power ratio of the formants can be adjusted by pre-filtering the data The process is functionally equivalent to a time-smoothed, windowed Wigner distribution, in which the cross-terms normally associated with the Wigner distribution are greatly attenuated by the smoothing operation.
Steven Greenberg, ICSI / UC Berkeley (U.S.A.)
Brian E.D. Kingsbury, ICSI / UC Berkeley (U.S.A.)
Understanding the human ability to reliably process and decode speech across a wide range of acoustic conditions and speaker characteristics is a fundamental challenge for current theories of speech perception. Conventional speech representations such as the sound spectrogram emphasize many spectro-temporal details that are not directly germane to the linguistic information encoded in the speech signal and which consequently do not display the perceptual stability characteristic of human listeners. We propose a new representational format, the modulation spectrogram, that discards much of the spectro-temporal detail in the speech signal and instead focuses on the underlying, stable structure incorporated in the low-frequency portion of the modulation spectrum distributed across critical-band-like channels. We describe the representation and illustrate its stability with color-mapped displays and with results from automatic speech recognition experiments.
François Pellegrino, IRIT (France)
Régine André-Obrecht, IRIT (France)
This paper presents our work on vowel system detection as part of a project of Automatic Language Identification using phonological typologies. We have developed a vowel detection algorithm based on spectral analysis of the acoustic signal and requiring no learning stage. It has been tested with two telephone speech corpora: - with a French corpus provided by the CNET, 7.4 % of detections are false while about 25 % of the vowels present in the signal are not found. - experiments with 5 languages of the OGI_TS corpus result in 88.1 % of correct detection and about 15 % of non-detection. We also present in this paper the Vector Quantizer (VQ) LBG-Rissanen Algorithm that we use for vowel system modeling. Preliminary experiments are reported.
David van Kuijk, Nijmegen University (The Netherlands)
Louis Boves, Nijmegen University (The Netherlands)
In this paper we investigate acoustic differences between vowels in syllables that do or don't carry lexical stress. The speech material on which the investigation is based differs from the type of material used in previous research: we used phonetically rich sentences from the Dutch POLYPHONE corpus. We shortly discuss the definition of the linguistic feature `lexical stress' and its possible impact on the phonetic realization. We then proceed to explain the experiments that were carried out and the presentation of the results. Although most of the Duration, Energy and Spectral Tilt features that we used in the investigation show statistically significant differences for the population means for stressed and unstressed vowels, it also appears that the distributions overlap to such an extent that automatic detection of stressed and unstressed syllables yields accuracy scores of not much more han 65%. It is argued that this is due to the large variety in the ways in which the abstract linguistic feature `lexical stress' is realized in the acoustic speech signal.
Minsheng Liu, University of Frankfurt (Germany)
Arild Lacroix, University of Frankfurt (Germany)
This paper presents a pole-zero model based on a multi-tube acoustic model for fricative sounds. This model consists of the front and back cavity formed by oral tract and pharynx, in which the excitation source is located at the point of constriction. The transfer function of this model including poles and zeros is derived andits properties are investigated. Small losses such as viscous friction which is an important for the fracative sound in the vocal tract are considered and the results show, if the vocal tract is lossless, the numerator part of the pole-zero model is symmetric. The transfer function with small losses overcomes the limitation of the symmetry.This method is applied by employing the inverse filtering and an adaptive algorithm to analyse fricative sounds.
Thomas Wittenberg, University of Erlangen (Germany)
Patrick Mergell, University of Erlangen (Germany)
Monika Tigges, University of Erlangen (Germany)
Ulrich Eysholdt, University of Erlangen (Germany)
A semiautomatic motion analysis software is used to extract elongation-time diagrams (trajectories) of vocal fold vibrations from digital highspeed video sequences. By combining digital image processing with biomechanical modeling we extract characteristic parameters such as phonation onset time and pitch. A modified two-mass model of the vocal folds is employed in order to fit the main features of simulated time series to those of the extracted trajectories. Due to the variation of the model parameters, general conclusions can be made about laryngeal dysfunctions such as functional dysphonia. We show the first results of semi-automatic motion analysis in combination with model simulations as a step towards a computer aided diagnosis of voice disorders.