Speech Analysis

Home

Recursive Linear Prediction Using OBE Identification With Automatic Bound Estimation

Authors:

John R. Deller, Michigan State University (U.S.A.)
Tsung Ming Lin, Michigan State University (U.S.A.)
Majid Nayeri, Michigan State University (U.S.A.)

Volume 2, Page 1279

Abstract:

Application of set-membership (SM) identification to real-time speech processing is made possible by the optimal bounding ellipsoid algorithm with automatic bound estimation (OBE-ABE) that blindly deduces model-input bounds. To date, lack of any tenable approach to estimating bounds in speech models has rendered these interesting new SM methods impractical. OBE-ABE is consistently convergent, offers significant computational advantages, and provides a set of feasible solutions in finite time.

ic971279.pdf

TOP

Nonlinear Long-Term Prediction of Speech Signals

Authors:

Martin Birgmeier, Vienna University of Technology (Austria)
Hans-Peter Bernhard, Vienna University of Technology (Austria)
Gernot Kubin, Vienna University of Technology (Austria)

Volume 2, Page 1283

Abstract:

We present an in-depth study of nonlinear long-term prediction of speech signals. Successful long-term prediction strongly depends on the nonlinear oscillator framework for speech modeling. This hypothesis has been confirmed in a series of experiments run on a voiced speech database. We provide results for the prediction gain as a function of the prediction delay using two methods. One is based on an extended form of radial basis function networks. The other relies on calculating the mutual information between multiple signal samples. We explain the role of this mutual information function as the upper bound on the achievable prediction gain. We show that with matching memory and dimension, the two methods yield nearly the same value for the achievable prediction gain. It turns out that the nonlinear predictor's gain is significantly higher than that for a linear predictor using the same parameters.

ic971283.pdf

TOP

Vocal Tract Shape Trajectory Estimation using MLP Analysis-by-Synthesis

Authors:

Hywel B. Richards, University of Wales Swansea (U.K.)
John S. Mason, University of Wales Swansea (U.K.)
John S. Bridle, Dragon Systems UK Ltd. (U.K.)
Melvyn J. Hunt, Dragon Systems UK Ltd. (U.K.)

Volume 2, Page 1287

Abstract:

The objective of this work is a computationally efficient method for inferring vocal tract shape trajectories from acoustic speech signals. We use an MLP to model the vocal tract shape-to-acoustics mapping, then in an analysis-by-synthesis approach, optimise an objective function that includes both the accuracy of the spectrum approximation and the credibility of the vocal tract dynamics. This optimisation carries out gradient descent using back-propagation of derivatives through the MLP. Employing a series of MLPs of increasing order avoids getting trapped in local optima caused by the many-to-one mapping between vocal tract shapes and acoustics. We obtain two orders of magnitude speed increase compared with our previous methods using codebooks and direct optimisation of a synthesiser.

ic971287.pdf

TOP

Fast and Robust Joint Estimation of Vocal Tract and Voice Source Parameters

Authors:

Ding Wen, ATR Interpreting Telecomm Research Lab. (Japan)
Nick Campbell, ATR Interpreting Telecomm Research Lab. (Japan)
Higuchi Norio, ATR Interpreting Telecomm Research Lab. (Japan)

Volume 2, Page 1291

Abstract:

A new pitch-synchronous method of joint estimation is described to estimate vocal tract and voice source parameters from speech signals based on an ARX model. The method uses Kalman filtering to estimate the time-varying coefficients and simulated annealing to deal with the non-linear optimization of Rosenberg-Klatt parameters. A compact method is suggested in the algorithm in order to reduce the computation cost. Further, an automatic model order selection method is proposed to determine the proper analysis pole-order of the ARX model, based on the estimated formant bandwidths. The new method has been shown to be much faster than our previous method and the order selection technique has been shown to be effective. Finally, an ATR two-channel speech database including varying sentence-level prominence patterns is used to verify the proposed method.

ic971291.pdf

TOP

Spectral correlates of glottal waveform models: an analytic study

Authors:

Boris Doval, LIMSI, Orsay (France)
Christophe d'Alessandro, LIMSI, Orsay (France)

Volume 2, Page 1295

Abstract:

This paper deals with spectral representation of the glottal flow. The LF and the KLGLOTT88 models of the glottal flow are studied. In a first part, we compute analytically the spectrum of the LF-model. Then, formulas are given for computing spectral tilt and amplitudes of the first harmonics as functions of the LF-model parameters. In a second part we consider the spectrum of the KLGLOTT88 model. It is shown that this model can be modeled in the spectral domain by an all-pole third-order linear filter. Moreover, the anticausal impulse response of this filter is a good approximation of the glottal flow model. Parameter estimation seems easier in the spectral domain. Therefore our results can be used for modification of the (hidden) glottal flow characteristic of natural speech signals, by processing directly the spectrum, without needing time-domain parameter estimation.

ic971295.pdf

TOP

A time varying ARMAX speech modeling with phase compensation using glottal source model

Authors:

Keiichi Funaki, Hokkaido University (Japan)
Yoshikazu Miyanaga, Hokkaido University (Japan)
Koji Tochinai, Hokkaido University (Japan)

Volume 2, Page 1299

Abstract:

This paper presents new speech analysis method based on a Glottal-ARMAX (Auto Regressive and Moving Average eXogenous) model with phase compensation. A Glottal-ARMAX model consists of two kinds of inputs: glottal source model excitation and a white gauss input, and a vocal tract ARMAX model. The proposed method can simultaneously estimate the glottal source model and vocal tract ARMAX model parameters pitch synchronously. In this method, ARMAX identification using a modified MIS(Model Identification System) method is adopted to estimate ARMAX parameters, and the hybrid approach of Genetic algorithm(GA) and Simulated annealing(SA) is employed to efficiently solve the non-linear simultaneous optimization of both parameters. Furthermore, phase compensation using an all-pass filter is introduced within a generation loop in the GA method in order to compensate phase distortion. Experiments using synthetic speech and natural speech demonstrate the efficacy of the proposed method.

ic971299.pdf

TOP

Speech Representation and Transformation using Adaptive Interpolation of Weighted Spectrum: VOCODER Revisited

Authors:

Hideki Kawahara, ATR-HIP (Japan)

Volume 2, Page 1303

Abstract:

A simple new procedure called STRAIGHT (Speech Transformation andRepresentation using Adaptive Interpolation of weiGHTed spectrum) has been developed.STRAIGHT usespitch-adaptive spectral analysis combined with a surfacereconstruction method in the time- frequency region, and an excitationsource design based on phase manipulation. It preserves the bilinear surface in the time-frequency regionand allows for over 600% manipulation of such speech parameters aspitch, vocal tract length, and speaking rate, without further degradation due to the parameter manipulation.

ic971303.pdf

2225_a.wav Original speech 'light' spoken by a female
2225_b.wav Resynthesized speech 'light' without modification
2225_c.wav Resynthesized speech 'light' with pitch=210%
2225_d.wav Resynthesized speech 'light' with pitch=60%, frequency=85%
2225_e.wav Resynthesized speech 'light' with pitch=600%, frequency=140%, duration=80%
2225_f.wav Resynthesized speech 'light' with pitch=130%, frequency=110%, duration=600%

TOP

The weft: A representation for periodic sounds

Authors:

Dan Ellis, ICSI (U.S.A.)

Volume 2, Page 1307

Abstract:

For the problem of separating sound mixtures, periodicity is a powerful cue used by both human listeners and automatic systems. Short-term autocorrelation of subband envelopes, as in the correlogram, accounts for much perceptual data. We present a discrete representation of common-period sounds, derived from the correlogram, for use in computational auditory scene analysis: The weft describes a sound in terms of a time-varying periodicity and a smoothed spectral envelope of the energy exhibiting that period. Wefts improve on several aspects of previous approaches by providing, without additional grouping, a single, invertible element for each detected signal, and also a provisional solution to detecting and dissociating energy of different periodicities in a single frequency channel (unlike systems which allocate whole frequency channels to one source). We define the weft, describe the analysis procedure we have devised, and illustrate its capacity to separate periodic sounds from other signals.

ic971307.pdf

TOP

A computationally efficient algorithm for calculating loudness patterns of narrowband speech

Authors:

Markus Hauenstein, University of Kiel (Germany)

Volume 2, Page 1311

Abstract:

Loudness patterns are closer to the human perception of sound waves than spectrograms. This paper describes how loudness patterns can be efficiently calculated with an allpass-transformed polyphase filterbank based on a mixed radix FFT and three subsequent non-linear stages that model masking effects in the frequency and time domain as well as loudness compression.

ic971311.pdf

TOP

Two-channel blind deconvolution for non-minimum phase impulse responses

Authors:

Ken'ichi Furuya, NTT HI Labs. (Japan)
Yutaka Kaneda, NTT HI Labs. (Japan)

Volume 2, Page 1315

Abstract:

A new blind deconvolution method is proposed for recovering an unknown source signal, which is observed through two unknown channels characterized by non-minimum phase impulse response filters. Conventional methods cannot estimate the non-minimum phase parts. Our method is based on computing the eigenvector corresponding to the smallest eigenvalue of the input correlation matrix and using a cost function to determine the order of the impulse response filter model. Multi-channel inverse filtering with the estimated impulse responses is used to recover the unknown source signal. Sub-band processing is also used to reduce the complexity of dealing with long impulse responses such as room impulse responses. Computer simulation shows that the effectiveness of our method.

ic971315.pdf

TOP

Variable Time-scale Modification of Speech using Transient information

Authors:

Sungjoo Lee, Pusan National University (Korea)
Hee Dong Kim, The University of Suwon (Korea)
Hyung Soon Kim, Pusan National University (Korea)

Volume 2, Page 1319

Abstract:

Conventional time-scale modification methods have the problem that as the modification rate gets higher the time-scale modified speech signal becomes less intelligible, because they ignore the effect of articulation rate on speech characteristics. In this paper, we propose a variable time-scale modification method based on the knowledge that the timing information of transient portions of a speech signal plays an important role in speech perception. After identifying transient and steady portions of a speech signal, the proposed method gets the target rate by modifying steady portions only. The result of subjective preference test indicates that the proposed method porduces performance superior to that of the conventional SOLA method.

ic971319.pdf

TOP

Speech Enhancement with Reduction of Noise Components in the Wavelet Domain

Authors:

Jong Won Seok, University of KNU (Korea)
Keun Sung Bae, University of KNU (Korea)

Volume 2, Page 1323

Abstract:

This paper describes a general problem of removing additive background noise from the noisy speech in the wavelet domain. A semisoft thresholding is used to remove noise components from the wavelet coefficients of noisy speech. To prevent the quality degradation of the unvoiced sounds during the denoising process, the unvoiced region is classified first and then thresholding is applied in a different way. Experimental results demonstrate that the proposed speech enhancement algorithm is very promising.

ic971323.pdf

TOP

Blind Separation and Restoration of Signals Mixed in Convolutive Environment

Authors:

Jiangtao Xi, McMaster University (Canada)
James P. Reilly, McMaster University (Canada)

Volume 2, Page 1327

Abstract:

This paper proposes new neural network approaches for separating and restoring signals mixed through FIR channels. Firstly, a set of maximal entropy based train rules are developed. Secondly, a new scheme for restoring the original signals is proposed for the 2X2 case. Computer simulation results for speech signals are presented to verify the proposed approaches.

ic971327.pdf

TOP

Construction and Evaluation of a Robust Multifeature Speech/Music Discriminator

Authors:

Eric Scheirer, Interval Research Corp. (U.S.A.)
Malcolm Slaney, Interval Research Corp. (U.S.A.)

Volume 2, Page 1331

Abstract:

We report on the construction of a real-time computer system capable of distinguishing speech signals from music signals over a wide range of digital audio input. We have examined 13 features intended to measure conceptually distinct properties of speech and/or music signals, and combined them in several multidimensional classification frameworks. We provide extensive data on system performance and the cross-validated training/test setup used to evaluate the system. For the datasets currently in use, the best classifier classifies with 5.8% error on a frame-by-frame basis, and 1.4% error when integrating long (2.4 second) segments of sound.