Spacer ICASSP '98 Main Page

Spacer
General Information
Spacer
Conference Schedule
Spacer
Technical Program
Spacer
    Overview
    50th Annivary Events
    Plenary Sessions
    Special Sessions
    Tutorials
    Technical Sessions
    
By Date
    May 12, Tue
May 13, Wed
May 14, Thur
May 15, Fri
    
By Category
    AE    ANNIV   
COMM    DSP   
IMDSP    MMSP   
NNSP    PLEN   
SP    SPEC   
SSAP    UA   
VLSI   
    
By Author
    A    B    C    D    E   
F    G    H    I    J   
K    L    M    N    O   
P    Q    R    S    T   
U    V    W    X    Y   
Z   

    Invited Speakers
Spacer
Registration
Spacer
Exhibits
Spacer
Social Events
Spacer
Coming to Seattle
Spacer
Satellite Events
Spacer
Call for Papers/
Author's Kit

Spacer
Future Conferences
Spacer
Help

Abstract -  SP26   


 
SP26.1

   
Determining Polarity of Speech Signals Based on Gradient of Spurious Glottal Waveforms
W. Ding, N. Campbell  (ATR-ITL, Japan)
Speech polarity is crucial in many speech processing fields. We present a novel method to determine polarity of speech signals from gradient of spurious glottal waveforms. We use the iterative adaptive LPC inverse filtering to cancel effect of vocal tract transfer function while maintaining the most properties of source excitation. Then we take the first-derivative (gradient component) of spurious glottal waveforms to capture the sharp gradient near the glottal closure instant. By using the gradient components of the spurious glottal waveforms, we detect speech polarity, I.e., the polarity of glottal waveforms, by finding whether the glottal closure instants are located above or below the zero-line. Furthermore, a frame-based decision technique is applied to get robust results. Experimental results with a wide variety of speech utterances reveal a high performance and the computation complexity is much more less than a previously proposed method.
 
SP26.2

   
Efficient Representation of Short-time Phase Based on Group Delay
H. Banno, J. Lu, S. Nakamura, K. Shikano  (Nara Institute of Science and Technology, Japan);   H. Kawahara  (Wakayama University, Japan)
An efficient representation of short-time phase characteristics of speech sounds is proposed, based on recent findings which suggest perceptual importance of phase characteristics. Subjective tests indicated that the synthesized speech sounds by the proposed method are indistinguishable from the original speech sounds with a moderate data compression. The proposed representation uses lower-order coefficients of inverse Fourier transform of the group delay of speech. It also alleviates the voiced-unvoiced decision, which is an indispensable part in conventional speech coding algorithms. These features make our method potentially very useful in many applications like speech morphing.
 
SP26.3

   
Polynomial Quasi-Harmonic Models for Speech Analysis and Synthesis
G. Fay, E. Moulines, O. Cappé  (ENST, France);   F. Bimbot  (IRISA, France)
Harmonic plus noise models have been successfully applied to a broad range of speech processing applications, including, among others, low bit-rate speech coding, and speech restoration and transformation. In conventional methods, the frequencies, the relative phases and the amplitudes of the pitch-harmonic components are assumed to be piecewise constants over an analysis frame. This assumption is inadequate in segments where fast variations of these parameters may occur, e.g. phoneme-to-phoneme boundaries or speech onsets. In this contribution, a time-varying models of the pitch-harmonic parameter is presented; it is based on a basis expansion technique, consisting in representing the time-varying functions as a linear combination of fixed basis function. An estimation procedure for the parameters of this expansion is presented. Results are provided to demonstrate the effectiveness of this approach.
 
SP26.4

   
Perceptual Relevance of Objectively Measured Descriptors for Speaker Characterization
B. Necioglu, M. Clements, T. Barnwell III  (Georgia Institute of Technology, USA);   A. Schmidt-Nielsen  (Naval Research Lab, USA)
Subjective testing of speaker recognizability is an intricate, time consuming and very expensive process, but using objectively measurable descriptors to augment the subjective speaker recognizability tests could result in increased efficiency and reliability. This paper describes our investigation into the relevancy of a set of objective descriptors to human perception of speaker identity through multidimensional scaling (MDS) of subjective speaker pair similarity judgments. The evaluated objective descriptors can achieve same/different detection error rates as low as 4.13% for male speaker pairs, and 8.17% for female speaker pairs, with only 3 seconds of speech. Five descriptors related to glottal, vocal tract and prosodic features were found to have significant correlations with the perceptual dimensions of the MDS solutions.
 
SP26.5

   
The Spectral Relevance of Glottal-Pulse Parameters
R. Veldhuis  (IPO, The Netherlands)
The paper analyses how variations of the parameters of the Liljencrants-Fant (LF) model of glottal flow influence the speech spectrum, in order to determine the spectral relevance of these parameters. The effects of small arameter variations are described analytically. This analysis also gives an indication to what extent the LF parameters can be estimated reliably from the speech spectrum. The effects of larger parameter variations are discussed with the help of figures. Results are presented for a number of sets of estimated glottal-pulse parameters that were taken from the literature. The ain conclusion is that the LF model, which, given the fundamental period, is a three-parameter model, actually operates as a one- or a two-parameter model.
 
SP26.6

   
Speech Synthesis Using Warped Linear Prediction and Neural Networks
M. Karjalainen, T. Altosaar  (Helsinki University of Technology, Finland);   M. Vainio  (University of Helsinki, Finland)
A text-to-speech synthesis technique, based on warped linear prediction (WLP) and neural networks, is presented for high-quality individual sounding synthetic speech. Warped linear prediction is used as a speech production model with wide audio bandwidth yet with highly compressed control parameter data. An excitation codebook, inverse filtered from a target speaker's voice, is applied to obtain individual tone quality. A set of neural networks, specialized to yield synthesis control parameters from phonemic input in specific contexts, generate the detailed parametric controls of WLP. Neural nets are also used successfully to compute the prosodic parameters. We have applied this approach in prototyping highly improved text-to-speech synthesis for the Finnish language.
 
SP26.7

   
Source-Filter Models for Time-Scale Pitch-Scale Modification of Speech
A. Acero  (Microsoft Research, USA)
This paper presents two time-scale pitch-scale modification techniques to be used in speech synthesis systems. They have been applied to Microsoft’s Whistler system, which is based on concatenative synthesis. Both methods are based on a source-filter model, one of them using LPC parameters and the other one using cepstral parameters. The proposed methods achieve high quality prosody modification, retain the characteristics of the donor speaker, allow for spectral manipulation (to reduce spectral discontinuities at unit boundaries), and yield compact acoustic inventories.
 
SP26.8

   
Speaker-Specific Pitch Contour Modeling and Modification
D. Chappell, J. Hansen  (Duke University, USA)
This paper describes new techniques for modeling and generating speaker-dependent pitch contours for sentences. Speech synthesis applications could generally benefit from such speaker-specific pitch contours. The proposed algorithms begin with an existing pitch contour for an utterance and use data from training utterances to modify the contour to be appropriate for a second speaker. One approach modifies the original pitch values to statistically match the desired speaker at each point in time. A second novel approach uses dynamic time warping (DTW) to select a new pitch contour from a pre-determined code book an time-align the chosen contour to the original sentence. Such contour mapping can transfer one speaker's natural pitch characteristics to another person's speech. Informal listener evaluations suggest that while shifting the frequency range of the original pitch contour yields some improvement, better results are obtained by applying DTW techniques to time-warp the contour from an existing sentence produced by the desired speaker.
 
SP26.9

   
Articulatory Synthesis of Formant Targeted Sounds with Parameters Derived from the Inverse Solution of Speech Production
Z. Yu  (Hanzhou University, P R China);   P. Ching  (Chinese University of Hong Kong, P R China)
A new approach to produce high fidelity speech sound by applying both the inverse solution of speech production and the pitch-synchronous articulatory synthesis techniques is presented. Given a formant trace target, the dynamic vocal-tract area function together with time variant VT length are estimated using an inverse solution of speech production. The improved Kelly-Lochbaum filter of the synthesizer, with multi-rate system sampling and dynamic scattering wave adjustment, is employed to deal with the variable VT length and VT area function. A distinguished feature of this method is that artificially specified formant traces can be precisely obtained. Experimental results show that the formant targets can be precisely matched by the synthetic sound. A potential application of this method for text-to-speech conversion is discussed.
 
SP26.10

   
Corpus-Based Mandarin Speech Synthesis with Contextual Syllabic Units Based on Phonetic Properties
F. Chou  (National Taiwan University, Taiwan, ROC);   C. Tseng  (Inst. of Linguistics, Acad. Sinica, Taiwan, ROC)
This paper describes an improved concatenative synthesis module for a Chinese text-to-speech system. The concatenated segments are on-line selected from a designed speech corpus that is precisely segmented with an improved version of HMM models. The selection criteria are the prosodic and contextual similarities between the units and the desire targets from the previous module of the TTS system. The TD-PSOLA modifies the prosodic parameters of the selected units, and three methods for unit concatenation are performed according to the types of the syllabic junctures. These types are classified with the knowledge from the phonetic observations of large amounts of speech data. The output speech is remarkably fluent and natural because the coarticulation effects cross syllabic boundaries are well modeled and less prosodic modification is needed for the TD-PSOLA.
 

< Previous Abstract - SP25

SP27 - Next Abstract >