Chair: Yves Laprie, LORIA, France
Yves Laprie, LORIA (France)
Bruno Mathieu, LORIA (France)
This paper presents a novel approach to recovering articulatory trajectories from the speech signal using a variational calculus method and Maeda's articulatory model. The acoustic-to-articulatory mapping is generally assessed by a double criterion: the acoustic proximity of results to acoustic data and the smoothness of articulatory trajectories. Most of the existing methods are unable to exploit the two criteria simultaneously or at least at the same level. On the other hand, our variational calculus approach combines the two criteria simultaneously and ensures the global acoustic and articulatory consistency without further optimization. This method gives rise to an iterative process which optimizes a startup solution given by an improved lookup algorithm. Codebooks generated with an articulatory model show nonuniform sampling of the acoustic space due to nonlinearities of the acoustic-to-articulatory mapping. We therefore designed an improved lookup algorithm building realistic articulatory trajectories which are not necessarily defined throughout the speech signal.
Takayuki Arai, International Computer Science Institute (U.S.A.)
Steven Greenberg, International Computer Science Institute (U.S.A.)
The spectrum of spoken sentences was partitioned into quarter-octave channels and the onset of each channel shifted in time relative to the others so as to desynchronize spectral information across the frequency axis. Human listeners are remarkably tolerant of cross-channel spectral asynchrony induced in this fashion. Speech intelligibility remains relatively unimpaired until the average asynchrony spans three or more phonetic segments. Such perceptual robustness is correlated with the magnitude of the low-frequency (3-6 Hz) modulation spectrum and thus highlights the importance of syllabic segmentation and analysis for robust processing of spoken language. High-frequency channels (>1.5 kHz) play a particularly important role when the spectral asynchrony is sufficiently large as to significantly reduce the power in the low-frequency modulation spectrum (analogous to acoustic reverberation) and may thereby account for the deterioration of speech intelligibility among the hearing impaired under conditions of acoustic interference (such as background noise and reverberation) characteristic of the real world.
Matthias Fröhlich, Drittes Physikalisches Institut Göttingen (Germany)
Dirk Michaelis, Drittes Physikalisches Institut Göttingen (Germany)
Hans Werner Strube, Drittes Physikalisches Institut Göttingen (Germany)
One important perceptual attribute of voice quality is breathiness. Since breathiness is generally regarded to be caused by glottal air leakage, acoustic measures related to breathiness may be used to distinguish between different physiological phonation conditions for pathological voices. Seven ``breathiness features'' described in the literature plus one self-developed measure (the glottal to noise excitation ratio, GNE) are compared for their distinguishing properties between different well-defined pathological phonation mechanisms. It is found that only GNE allows a distinction between all the pathological groups and both the normal and aphonic reference group. Furthermore, GNE is among the measures showing the most significant distinctions between the different pathologic phonation mechanism groups. Therefore GNE should be given preference over the other features in the independent assessment of glottal air leakage or ``breathiness'' for moderately or highly disturbed voices.
Chang-Sheng Yang, Utsunomiya University (Japan)
Hideki Kasuya, Utsunomiya University (Japan)
An automatic method is proposed to estimate jointly formant and voice source parameters from a speech signal. A Rosenberg-Klatt model is used to approximate a voicing source waveform for voiced speech, whereas a white noise signal is assumed for the unvoiced. The vocal tract characteristic is represented by an IIR filter. The formant and anti-formant values are calculated from the IIR filter coefficents which are estimated by using the subspace-based system identification algorithm, while an exhaustive search procedure is applied to obtain the optimal source parameter values, where an error criterion is introduced in the frequency domain. An experiment has been performed to examine performance of the proposed method with natural speech. The results show that the source parameters such as open and closure instants estimated by the method is in good agreement with those defined on the electro-glottograp signals and the formant values estimated are also accurate.
Thilo Pfau, Technical University of Munich (Germany)
Guenther Ruske, Technical University of Munich (Germany)
We present a new feature-based method for estimating the speaking rate by detecting vowels in continuous speech. The features used are the modified loudness and the zerocrossing rate which are both calculated in the standard preprocessing unit of our speech recognition system. As vowels in general correspond to syllable nuclei, the feature-based vowel rate is comparable to an estimate of the lexically-based syllable rate. The vowel detector presented is tested on the spontaneously spoken German Verbmobil task and is evaluated using manually transcribed data. The lowest vowel error rate (including insertions) on the defined test set is 22,72% on average over all vowels. Additionally correlation coefficients between our estimates and reference rates are calculated. These coefficients reach up to 0,796 and therefore are comparable to those for lexically-based measures (like the phone rate) on other tasks. The accuracy is sufficient to use our measurement for speaking rate adaptation.
James DeLucia, Center for Communications Research (U.S.A.)
Fred Kochman, Center for Communications Research (U.S.A.)
We describe a new noniterative algorithm that generates the unique area function determined by the vocal tract length, the lip radius, and the spectral pair consisting of poles of the transfer function and zeros of the input impedance function. Our analysis is restricted to the class of piecewise-constant area functions defined on an even number of equal length intervals. The resulting algorithm involves fewer floating point operations per evaluation than the analogous method of Paige and Zue [4]. A method which uses a corpus of X-ray data is discussed for setting the higher order unobservable pole/zero frequencies.
Gaguk Zakaria, Virginia Tech (U.S.A.)
Louis A Beex, Virginia Tech (U.S.A.)
We propose the adaptive cascade recursive least squares (CRLS-SA) algorithm for the estimation of linear prediction, or AR model, coefficients. The CRLS-SA algorithm features low computational complexity since each section is adapted independently from the other sections. It is shown here that the CRLS-SA algorithm can yield AR coefficient estimates closer to the true values, for some known signals, than the widely used autocorrelation method. CRLS-SA converges faster to the true values of the model, which is critically important for estimation from short data records. While the computational effort of CRLS-SA is a factor of 3 to 4 higher than that for the autocorrelation method, the improvement in performance yields a viable alternative for a number of applications.
Athaudage CR Nandasena, Advanced Institute of Science and Technology (Japan)
Masato Akagi, Advanced Institute of Science and Technology (Japan)
In this paper a new approach to temporal decomposition (TD) of speech, called ``Spectral Stability Based Event Localizing Temporal Decomposition'', abbreviated SBEL-TD, is presented. The original method of TD proposed by Atal is known to have the drawbacks of high computational cost, and the instability of the number and locations of events [1]. In SBEL-TD, the event localization is performed based on a maximum spectral stability criterion. This overcomes the instability problem of events of the Atal's method. Also, SBEL-TD avoids the use of the computationally costly singular value decomposition routine used in the Atal's method, thus resulting in a computationally simpler algorithm of TD. Simulation results show that an average spectral distortion of about 1.5 dB can be achieved with LSF as the spectral parameter. Also, we have shown that the temporal pattern of the speech excitation parameters can also be well described using the SBEL-TD technique.