Chair: Gerard Chollet, ENST, France
Wen Ding, ATR-ITL (Japan)
Nick Campbell, ATR-ITL (Japan)
Speech polarity is crucial in many speech processing fields. We present a novel method to determine polarity of speech signals from gradient of spurious glottal waveforms. We use the iterative adaptive LPC inverse filtering to cancel effect of vocal tract transfer function while maintaining the most properties of source excitation. Then we take the first-derivative (gradient component) of spurious glottal waveforms to capture the sharp gradient near the glottal closure instant. By using the gradient components of the spurious glottal waveforms, we detect speech polarity, I.e., the polarity of glottal waveforms, by finding whether the glottal closure instants are located above or below the zero-line. Furthermore, a frame-based decision technique is applied to get robust results. Experimental results with a wide variety of speech utterances reveal a high performance and the computation complexity is much more less than a previously proposed method.
Hideki Banno, Nara Institute of Science and Technology (Japan)
Jinlin Lu, Nara Institute of Science and Technology (Japan)
Satoshi Nakamura, Nara Institute of Science and Technology (Japan)
Kiyohiro Shikano, Nara Institute of Science and Technology (Japan)
Hideki Kawahara, Wakayama University (Japan)
An efficient representation of short-time phase characteristics of speech sounds is proposed, based on recent findings which suggest perceptual importance of phase characteristics. Subjective tests indicated that the synthesized speech sounds by the proposed method are indistinguishable from the original speech sounds with a moderate data compression. The proposed representation uses lower-order coefficients of inverse Fourier transform of the group delay of speech. It also alleviates the voiced-unvoiced decision, which is an indispensable part in conventional speech coding algorithms. These features make our method potentially very useful in many applications like speech morphing.
Gilles Fay, ENST (France)
Eric Moulines, ENST (France)
Olivier Cappé, ENST (France)
Frederic Bimbot, IRISA (France)
Harmonic plus noise models have been successfully applied to a broad range of speech processing applications, including, among others, low bit-rate speech coding, and speech restoration and transformation. In conventional methods, thefrequencies, the relative phases and the amplitudes of the pitch-harmonic components are assumed to be piecewise constants over an analysis frame. This assumption is inadequate in segments where fast variations of these parameters may occur, e.g. phoneme-to-phoneme boundaries or speech onsets. In this contribution, a time-varying models of the pitch-harmonic parameter is presented; it is based on a basis expansion technique, consisting in representing the time-varying functions as a linear combination of fixed basis function. An estimation procedure for the parameters of this expansion is presented. Results are provided to demonstrate the effectiveness of this approach.
Burhan F Necioglu, Georgia Institute of Technology (U.S.A.)
Mark A. Clements, Georgia Institute of Technology (U.S.A.)
Thomas P Barnwell III, Georgia Institute of Technology (U.S.A.)
Astrid Schmidt-Nielsen, Naval Research Laboratory (U.S.A.)
Subjective testing of speaker recognizability is an intricate, time consuming and very expensive process, but using objectively measurable descriptors to augment the subjective speaker recognizability tests could result in increased efficiency and reliability. This paper describes our investigation into the relevancy of a set of objective descriptors to human perception of speaker identity through multidimensional scaling (MDS) of subjective speaker pair similarity judgments. The evaluated objective descriptors can achieve same/different detection error rates as low as 4.13% for male speaker pairs, and 8.17% for female speaker pairs, with only 3 seconds of speech. Five descriptors related to glottal, vocal tract and prosodic features were found to have significant correlations with the perceptual dimensions of the MDS solutions.
Raymond N.J. Veldhuis, IPO (The Netherlands)
The paper analyses how variations of the parameters of the Liljencrants-Fant(LF) model of glottal flow influence the speech spectrum, in order to determinethe spectral relevance of these parameters. The effects of small arametervariations are described analytically. This analysis also gives an indicationto what extent the LF parameters can be estimated reliably from the speechspectrum. The effects of larger parameter variations are discussed with thehelp of figures. Results are presented for a number of sets of estimatedglottal-pulse parameters that were taken from the literature. The ainconclusion is that the LF model, which, given the fundamental period, is athree-parameter model, actually operates as a one- or a two-parametermodel.
Matti Karjalainen, Helsinki University of Technology (Finland)
Toomas Altosaar, Helsinki University of Technology (Finland)
Martti Vainio, University of Helsinki (Finland)
A text-to-speech synthesis technique, based on warped linear prediction (WLP) and neural networks, is presented for high-quality individual sounding synthetic speech. Warped linear prediction is used as a speech production model with wide audio bandwidth yet with highly compressed control parameter data. An excitation codebook, inverse filtered from a target speaker's voice, is applied to obtain individual tone quality. A set of neural networks, specialized to yield synthesis control parameters from phonemic input in specific contexts, generate the detailed parametric controls of WLP. Neural nets are also used successfully to compute the prosodic parameters. We have applied this approach in prototyping highly improved text-to-speech synthesis for the Finnish language.
Alex Acero, Microsoft Research (U.S.A.)
This paper presents two time-scale pitch-scale modification techniques to be used in speech synthesis systems. They have been applied to Microsoft's Whistler system, which is based on concatenative synthesis. Both methods are based on a source-filter model, one of them using LPC parameters and the other one using cepstral parameters. The proposed methods achieve high quality prosody modification, retain the characteristics of the donor speaker, allow for spectral manipulation (to reduce spectral discontinuities at unit boundaries), and yield compact acoustic inventories.
David T. Chappell, Duke University (U.S.A.)
John H.L. Hansen, Duke University (U.S.A.)
This paper describes new techniques for modeling and generating speaker-dependent pitch contours for sentences. Speech synthesis applications could generally benefit from such speaker-specific pitch contours. The proposed algorithms begin with an existing pitch contour for an utterance and use data from training utterances to modify the contour to be appropriate for a second speaker. One approach modifies the original pitch values to statistically match the desired speaker at each point in time. A second novel approach uses dynamic time warping (DTW) to select a new pitch contour from a pre-determined code book an time-align the chosen contour to the original sentence. Such contour mapping can transfer one speaker's natural pitch characteristics to another person's speech. Informal listener evaluations suggest that while shifting the frequency range of the original pitch contour yields some improvement, better results are obtained by applying DTW techniques to time-warp the contour from an existing sentence produced by the desired speaker.
Zhenli Yu, Hanzhou University (China)
P.C. Ching, Chinese University of Hong Kong (Hong Kong)
A new approach to produce high fidelity speech sound by applying both the inverse solution of speech production and the pitch-synchronous articulatory synthesis techniques is presented. Given a formant trace target, the dynamic vocal-tract area function together with time variant VT length are estimated using an inverse solution of speech production. The improved Kelly-Lochbaum filter of the synthesizer, with multi-rate system sampling and dynamic scattering wave adjustment, is employed to deal with the variable VT length and VT area function. A distinguished feature of this method is that artificially specified formant traces can be precisely obtained. Experimental results show that the formant targets can be precisely matched by the synthetic sound. A potential application of this method for text-to-speech conversion is discussed.
Fu-chiang Chou, National Taiwan University (Taiwan)
Chiu-yu Tseng, Institute of Linguistics, Academia Sinica (Taiwan)
This paper describes an improved concatenative synthesis module for a Chinese text-to-speech system. The concatenated segments are on-line selected from a designed speech corpus that is precisely segmented with an improved version of HMM models. The selection criteria are the prosodic and contextual similarities between the units and the desire targets from the previous module of the TTS system. The TD-PSOLA modifies the prosodic parameters of the selected units, and three methods for unit concatenation are performed according to the types of the syllabic junctures. These types are classified with the knowledge from the phonetic observations of large amounts of speech data. The output speech is remarkably fluent and natural because the coarticulation effects cross syllabic boundaries are well modeled and less prosodic modification is needed for the TD-PSOLA.