Session TMa Feature Estimation II, Pitch and Prosody

Chairman Egidio Giachin CSELT, Italy

Home

A MODIFIED ZERO-CROSSING METHOD FOR PITCH DETECTION IN PRESENCE OF INTERFERING SOURCES

Authors: Francois GAILLARD, Frederic BERTHOMMIER, Gang FENG, Jean-Luc SCHWARTZ

Institut de la Communication Parlée, UPRESA 5009 46 avenue Félix Viallet 38031 GRENOBLE cedex 01 Tel: +33 04 76 57 47 15 FAX: +33 04 76 57 47 10, E-mail: gaillard@icp.grenet.fr

Volume 1 pages 445 - 448

ABSTRACT

This paper evaluates, in terms of speech signal processing, a non-linear method of pitch detection based on the detection of the zero-crossings of the signals (ZC method), in adverse conditions of interference. First, F0 identification is evaluated according to the relative level of energy between the components in mixtures of pure tones or pairs of vowels; then, we introduce in the double-vowel paradigm a confidence measure based on the standard deviation of inter-zero intervals. Finally, the robustness of this confidence measure is tested in two cases of interference : pure tones + noise, and vowels + noise. We show that the method allows to detect periodicity without any knowledge about the nature of the interfering sources, and then to identify their fundamental frequency.

A0020.pdf

TOP

USING SIMULATED ANNEALING EXPECTATION MAXIMIZATION ALGORITHM FOR HIDDEN MARKOV MODEL PARAMETERS ESTIMATION

Authors: J. Simonin & C. Mokbel

France Télécom - CNET - LAA/TSS/RCP Technopole Anticipa 2, Avenue Pierre Marzin, 22307 Lannion - FRANCE e-mail: simonin@lannion.cnet.fr

Volume 1 pages 449 - 452

ABSTRACT

This paper presents the use of a simulated annealing technique during the parameters estimation of a Hidden Markov Model (HMM) in a speech recognition system. This technique allows to move out of a local optimum which characterizes a classical Expectation Maximization (EM) algorithm, and thus to achieve a better estimation with a limited amount of training data. We choose here the Simulated Annealing Expectation Maximization (SAEM) algorithm introducing a simulated annealing technique in the EM method. The SAEM algorithm is compared to the classical EM algorithm, for both task-independent and task-dependent Viterbi training. The evaluation leads to significant improvement of recognition performances.

A0111.pdf

TOP

COVARIATION OF SUBGLOTTAL PRESSURE, F0 AND GLOTTAL PARAMETERS

Authors: Gunnar Fant, Stellan Hertegard*, Anita Kruckenberg and Johan Liljencrants

Dept. of Speech, Music and Hearing, KTH, Stockholm, S-10044, E-mail gunnar@speech.kth.se *Huddinge University Hospital Stockholm

Volume 1 pages 453 - 456

ABSTRACT

Sub- and supraglottal pressures, Psub and Psup, have been recorded during glissando phonation of sustained vowels, isolated vowels, and continuous speech including a one minute Iong reading of a novel. Studies of the covarition of Psub, F0, voice excitation amplitude Ee and overall sound pressure Ieve1 reveal systematic relations some of which can be expressed in closed form by regression equations. Systematic dfferences in Psub with respect to position in a breathgroup, vowel and consonant category and the degree of stress have been observed. The domain of subglottal increase with stress is of the order of one or a few words rather than a single syllable. The global contour of Psub within a breathgroup and the Ioca1 finestructures of Psub and transglottal pressure, Ptr=Psub-Psup associated with specific articulatory events are described and discussed.

A0115.pdf

TOP

THE FRACTAL BEHAVIOUR OF UNVOICED PLOSIVES: A MEANS FOR CLASSIFICATION

Authors: Anastasios Delopoulos and Maria Rangoussi

Computer Science Division, Department of Electrical Engineering, National Technical University of Athens, Athens GR-15780, GREECE Tel.: +30 1 772 24 91, Fax: +30 1 772 24 92 e-mail: fadelo, mariag@image.ntua.gr

Volume 1 pages 457 - 460

ABSTRACT

Investigation of the fractal behaviour of unvoiced plosive consonants leads to interesting observations towards their classification. Experimental evidence of the fractal nature of the speech signals themselves, as well as of their derivatives and cumulative sums prompt the use of the associated fractal dimensions to form a discriminative feature set. The obtained feature set is compact in representation and easy to compute. At the same time, the discriminating capability of this feature set is seen to be promising even for speech signals sampled at 8KHz.

A0137.pdf

TOP

A METHOD FOR ANALYSIS OF THE LOCAL SPEECH RATE USING AN INVENTORY OF REFERENCE UNITS

Authors: Sumio Ohno, Hiroya Fujisaki and Hideyuki Taguchi

Department of Applied Electronics, Science University of Tokyo 2641 Yamazaki, Noda, 278 Japan

Volume 1 pages 461 - 464

ABSTRACT

The speech rate is one of the important prosodic parameters essential for the naturalness of an utterance, yet comparatively little is known on the fine structures of speech rate variation in natural utterances. On the basis of the authors' definition of the relative local speech rate, the present paper describes an analysis of the changes in the local rate of speech units, each produced in isolation, when they are embedded in connected speech. The results, together with those already obtained by the authors, will lead to a complete scheme for speech rate control in speech synthesis by concatenation.

A0141.pdf

TOP

ANALYSIS AND MODELING OF FUNDAMENTAL FREQUENCY CONTOURS OF GREEK UTTERANCES

Authors: Hiroya Fujisaki, Sumio Ohno and Takashi Yagi

Department of Applied Electronics, Science University of Tokyo 2641 Yamazaki, Noda, 278 Japan

Volume 1 pages 465 - 468

ABSTRACT

A quantitative model for the process of 0 contour generation, originally developed for Japanese by Fujisaki and his co-workers, has already been shown to be valid for several other languages. The present study aims at testing its applicability to 0 contours of Greek utterances. Analysis of 0 contours of 200 utterances by two native speakers of Greek, produced by reading texts of narrations and conversations, has shown that the model is essentially valid, and suggests the model's usefulness for Text-to-Speech synthesis of Greek.

A0142.pdf

TOP

CHARACTERISTICS OF SLOW, AVERAGE AND FAST SPEECH AND THEIR EFFECTS IN LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION

Authors: F. Martinez, D. Tapias, J. Álvarez, P. Leon

Speech Technology Group Telefónica Investigación y Desarrollo, S.A. C/ Emilio Vargas, 6 28043 - Madrid (Spain) Tel. +34 1 337-42-52, FAX: +34 1 337-42-02, E-mail: daniel@craso.tid.es

Volume 1 pages 469 - 472

ABSTRACT

In this paper we report the characteristics of slow, average and fast speech. The study has been done using the TRESVEL Spanish database. It is composed of 3200 sentences uttered at three different speech rates and contains speech material from 20 male and 20 female speakers. This database has been designed to study, evaluate and compensate the effect of speech rate in Large Vocabulary Continuous Speech Recognition (LVCSR) systems. We report a new measure for the rate of speech (ROS). The ROS is normalised using an appropriate set of constants that depends on the expected duration of each phone. We also report the characteristics of slow, average and fast speech. Finally, we report the degradation in performance of a continuous speech recognition system when the speech rate is low and high, and the evaluation of two compensation techniques. Adaptation of the language weight, insertion penalties and HMM state-transition probabilities for slow speech provides a 21.5% reduction of the word error rate (WER).

A0164.pdf

TOP

ANALYSIS OF CHILDREN'S SPEECH: DURATION, PITCH AND FORMANTS

Authors: Sungbok Lee , Alexandros Potamianos and Shrikanth Narayanan

AT&T Labs--Research, 180 Park Ave, P.O. Box 971, Florham Park, NJ 07932-0971, U.S.A. email: {sungbok,potam,shri}@research.att.com

Volume 1 pages 473 - 476

ABSTRACT

Magnitude and variability of duration, pitch and formant frequencies are computed for speech collected from five to eighteen year-old children. The study confirmed that reduction in magnitude and variability are the primary indicators of speech development. Specifically, children below age ten exhibit wider dynamic range of vowel duration, longer suprasegmental duration, and larger temporal and spectral variations. These trends diminish around age twelve. Children's speech acoustic characteristics fully develop to adult level in both magnitude and variability around age fifteen. Change of formant frequencies in male speakers parallels the growth of the vocal tract, while for female speakers the presence of such a linear trend is not clear. We conclude that the primary factors governing the acoustic patterns during speech development are anatomical maturation of the speech apparatus and speech motor control in terms of agility and precision.

A0185.pdf

TOP

A METHOD OF MEASURING FORMANT FREQUENCIES AT HIGH FUNDAMENTAL FREQUENCIES

Authors: Hartmut Traunmuller Anders Eriksson

Dept. of Linguistics Stockholm University hartmut@ling.su.se Dept. of Phonetics Umeå University S-901 87 Umeå, Sweden anderse@ling.umu.se

Volume 1 pages 477 - 480

ABSTRACT

Accurate measurement of formant frequencies is important in many studies of speech perception and production. Errors in formant frequency estimation by eye, using a spectrogram, or automatically, using linear prediction, have been reported to be as high as 60 Hz at F0 < 300 Hz. This exceeds the typical auditory difference limens (DLs) for formant frequencies and is also greater than some of the variation that one would like to study, e.g., the acoustic effects of varying vocal effort. The problem becomes substantially worse when F0 is as high as 500 to 600 Hz, which is not uncommon in the speech of women and children at high vocal efforts. In comparison with ordinary linear predictive analysis, the method described here drastically reduces measurement errors, given that the formant frequency is not below or only slightly above F0 (which rarely happens in speech). It thus becomes possible to study formant frequency variation in speech material that hitherto could not be analysed meaningfully since the effects of interest were no larger than the probable errors in measurement.

A0333.pdf

TOP

Analysis of Speaking Rate Variations in Stress-timed Languages

Authors: Tom Brondsted and Jens Printz Madsen*

Center for PersonKommunikation Aalborg University, Fredrik Bajers Vej 7 A2, DK-9220 Aalborg Øst, Denmark. Tel. +45 96 35 86 36, FAX: +45 98 15 15 83, E-mail: tb,@cpk.auc.dk.

Volume 1 pages 481 - 484

ABSTRACT

This paper analyses speaking rate variations in English and Danish and relates them to problems encountered in speech recognition. Intra speaker variabilities in speech rates are explained with reference to time equalisation of stress groups and utterances. Further, it is shown that certain natural classes of phonemes are more affected by speaking rate variations than others. Keywords: Phoneme modeling, rate of speech, phone duration, time equalisation of stress groups and phrases.

A0384.pdf

TOP

AUTOMATIC IDENTIFICATION OF PHONEME BOUNDARIES USING A MIXED PARAMETER MODEL

Authors: Paul Micallef, Ted Chilton

Dept. of Communications and Computer Engineering, University of Malta. School of Electronic Engineering, Information Technology and Mathematics,University of Surrey, UK. E-Mail: pjmica@eng.um.edu.mt; E.Chilton@ee.surrey.ac.uk

Volume 1 pages 485 - 488

ABSTRACT

The identification of phoneme boundaries in continuous speech is an important problem in areas of speech recognition and synthesis. The use of robust parameters to allow a trained data set obtained from one language to be used for boundary identification in another language is being investigated. In particular the use of mixed time- frequency rate parameters, and the training on the change of the rate parameters at acoustic boundaries is reported.

A0703.pdf

TOP

PITCH DETECTION RELIABILITY ASSESSMENT FOR FORENSIC APPLICATIONS

Authors: Serguei Koval, Veronika Bekasova, Michael Khitrov, Andrey Raev

Speech Technology Center, Sankt- Petersburg, Russia Tel./Fax:+7(812)3279297a E-mail:master@stc.rus. net

Volume 1 pages 489 - 492

ABSTRACT

For some tasks (e.g. ,forensic applications) it is vital important to know real pitch. So there is a problem to check-up the correctness of any concrete pitch contour for long speech records for any pitch detection method. Besides it would be useful to see the degree of signal periodicity without strong decision voice/noise. The new homomorphic method of signal periodicity degree detection with analysis frame length in proportion to the time-lag is described. The powerfull working approach to visual analysis of speech signal periodicity is proposed. This Foicograms method of speech periodicity representation ensures the practical correctness of pitch estimation and allows to find periodicity degree for poor quality signal.

A0837.pdf

TOP

EFFICIENT ESTIMATION OF PERCEPTUAL FEATURES FOR SPEECH RECOGNITION

Authors: Zhihong Hu, and Etienne Barnard

Center for Spoken Language Understanding, Oregon Graduate Institute of Science and Technology, 20000 N.W. Walker Road, P.O. Box 91000, Portland, OR 97291-1000, USA, (zhihong@cse.ogi.edu)

Volume 1 pages 493 - 496

ABSTRACT

A number of studies have shown that a pair of perceptual effective formants can be defined to capture most of the phonetic information present in vowels. Various methods of computing the effective formant values were proposed. However, many of them depend on the accuracy of conventional formant estimation. In this work, we study methods of automatically estimating perceptual effective formants without estimating the actual formant values and compare the results with the perceptually measured effective formant values. The preliminary results show that the method is effective in estimating the perceptual effective formants. Classification experiments using perceptual effective formants as explicit features do not demonstrate any advantages. However, using the perceptual effective second formant value as input to our formant estimation algorithm can help to correct up to 44% of the formant tracking errors.

A0928.pdf

TOP

Towards decomposing the sources of variability in speech

Authors: Narendranath Malayath (1) , Hynek Hermansky (1),(2) and Alexander Kain (1)

(1) Oregon Graduate Institute of Science and Technology, Portland, Oregon, USA. (2) International Computer Science Institute, Berkeley, California, USA.

Volume 1 pages 497 - 500

ABSTRACT

In this paper a method to decompose a conventional feature space (LPC-cepstrum) into subspaces which carry information about the linguistic and speaker variability is presented. Principal component analysis is used to study the correlation between these sub-spaces. Oriented principal component analysis (OPCA) is then used to estimate a sub-space which is relatively speaker- independent. A method to estimate the dimensionality of the speaker independent sub-space is also presented. Original features can now be projected into the speaker independent sub-space to make them less sensitive to speaker variations. Finally the effectiveness of the proposed method in suppressing the speaker dependence is studied by experiments conducted on two different databases.

A0942.pdf

TOP

USE OF VECTOR-VALUED DYNAMIC WEIGHTING COEFFICIENTS FOR SPEECH RECOGNITION: MAXIMUM LIKELIHOOD APPROACH

Authors: Rathinavelu Chengalvarayan

Currently at: Speech Processing Group, Bell Labs Lucent Technologies, Naperville, IL 60566, USA Tel: (630) 224 6398, Fax: (630) 979 5915 Email: rathi@lucent.com

Volume 1 pages 501 - 504

ABSTRACT

In this paper, an integrated approach to vector dynamic feature extraction is proposed in the design of a hidden Markov model (HMM) based speech recognizer. The integrated model we developed in this study generalizes the conventional, currently widely used dynamic-parameter technique, which has been confined strictly to the preprocessing domain only, in two significant ways. First, the new model contains state-dependent, vector-valued weighting functions responsible for transforming static speech features into the dynamic ones in a slowly time-varying manner. Second, a novel maximum-likelihood based training algorithm is developed for the model that allows joint optimization of the state-dependent, vector-valued weighting functions and the remaining conventional HMM parameters. The experimental results on alphabet classification demonstrate the effectiveness of the new model relative to standard HMM using dynamic features that have not been subject to optimization during training.

A0986.pdf

TOP

AUTOMATIC SEGMENTATION: DATA-DRIVEN UNITS OF SPEECH

Authors: S. W. Beet and L. Baghai-Ravary

Aculab plc Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK. Tel. +44 1908 273961; Fax. +44 1908 273801 Steve.Beet@aculab.com; Ladan.Ravary@aculab.com

Volume 1 pages 505 - 508

ABSTRACT

An algorithm is presented which allows non parametric representations of speech to be automatically segmented into units of comparable duration and character to manually-defined phonemes. The consistency of this segmentation across speakers, and across telephone channels, is investigated and the implications of adopting such forms of data for automatic speech recognition are discussed.

A1040.pdf

TOP

ON ROBUST TIME-VARYING AR SPEECH ANALYSIS BASED ON T-DISTRIBUTION

Authors: Dejan Bajic

Institute of Applied Mathematics and Electronics, Kneza Milo{a 37, 11000 Belgrade, Yugoslavia fax: +381 11 186-105, e-mail: EBAJICD@UBBG.ETF.BG.AC.YU

Volume 1 pages 509 - 512

ABSTRACT

In this paper a new robust non-recursive algorithm for parameter estimation of AR model of speech signal is proposed. The proposed algorithm takes into account the quasi-periodic excitation for voiced speech and assumes the t-distribution with small degrees of freedom a of the excitation signal. The method is based on the covariance linear prediction with sliding window. Experiments on both synthesized and natural speeches have shown that the proposed robust algorithm gives estimates with smaller variance and bias, compared to the conventional non-robust algorithm. The choice of a=3 induces to the most efficient estimation.

A1100.pdf

TOP

A SIMPLE PHONEME ENERGY MODEL FOR THE GREEK LANGUAGE AND ITS APPLICATION TO SPEECH RECOGNITION

Authors: Dimitris Tambakas, Iliana Tzima, Nikos Fakotakis, George Kokkinakis

Wire Communications Laboratory, University of Patras, 261 10 Patras, Greece Tel:+30 61 991722, Fax:+30 61 991855, E-Mail:tambakas@wcl.ee.upatras.gr

Volume 1 pages 513 - 516

ABSTRACT

This paper deals with the improvement of autosegmentation algorithms by establishing and implementing a simple energy model. This model consists of rules which describe the variation of the phoneme energy at the phoneme boundaries due to the phoneme context. The efficient estimation of phoneme boundaries results to the improvement of the accuracy of phoneme-based, large vocabulary speech recognition systems, as proven from experiments in the Greek language.

A1182.pdf

TOP

A MACROSCOPIC ANALYSIS OF AN EMOTIONAL SPEECH CORPUS

Authors: J.E.H. Noad (1),S.P. Whiteside (1) and P.D. Green (2)

(1) Department of Human Communication Sciences and (2) Department of Computer Science University of Sheffield Sheffield, S10 2TN, England. J.E.Noad@shef.ac.uk, S.Whiteside@shef.ac.uk, P.Green@dcs.shef.ac.uk

Volume 1 pages 517 - 520

ABSTRACT

Macroscopic analysis of a corpus of emotional Standard Southern British speech signals has been performed to measure any changes in average fundamental frequency, speech rate, energy and first formant frequency. Seven acted emotional states were recorded and analysed for one male and one female speaker. Differences between neutral and emotional speech were found which agree with changes others have mentioned in the literature. Only the emotion sadness was found to be consistently and obviously different from neutral, while high activity emotions (such as elation and hot anger) could be distinguished from sadness. Additional measures are being developed which will further discriminate the emotions from one another. Results obtained to date are being evaluated for use in a speech synthesiser system.

A1239.pdf

TOP

RESTORATION OF PITCH PATTERN OF SPEECH BASED ON A PITCH GENERATION MODEL

Authors: Hiroshi Shimodaira , Mitsuru Nakai and Akihiro Kumata

School of Information Science, Japan Advanced Institute of Science and Technology, Tatsunokuchi, Ishikawa, 923-12 Japan E-mail: sim@jaist.ac.jp

Volume 1 pages 521 - 524

ABSTRACT

In this paper a model-based approach for restoring a continuous fundamental frequency (F 0 ) contour from the noisy output of an F 0 extractor is investigated. In contrast to the conventional pitch trackers based on numerical curve-fitting, the proposed method employs a quantitative pitch generation model, which is often used for synthesizing F 0 contour from prosodic event commands for estimating continuous F 0 pattern. An inverse filtering technique is introduced for obtaining the initial candidates of the prosodic commands. In order to find the optimal command sequence from the commands efficiently, a beam- search algorithm and an N-best technique are employed. Preliminary experiments for a male speaker of the ATR B-set database showed promising results both in quality of the restored pattern and estimation of the prosodic events.

A1241.pdf

TOP

The Research of Correlation Between Pitch and Skin Galvanic Reaction at Change of Human Emotional State .

Authors: A.V.Agranovski, O.Y.Berg, D.A.Lednov

SPETSVUZAVTOMATIKA DESIGN BUREAR 51 Gazetny St.,Rostov-on-Don, Russia, e-mail:asni@ns.rnd.runnet.ru

Volume 1 pages 525 - 528

ABSTRACT

A1245.pdf

TOP

K-NN VERSUS GAUSSIAN IN HMM-BASED RECOGNITION SYSTEM

Authors: Claude Montacié, Marie-José Caraty and Fabrice Lefèvre

LIP6 - Université Pierre et Marie Curie - CNRS 4, place Jussieu - 75252 Paris Cedex 5 - France Tel. (33/0) 1 44 27 62 81, FAX (33/0) 1 44 27 70 00, e-mail: montacie@laforia.ibp.fr

Volume 1 pages 529 - 532

ABSTRACT

For many years, the K-Nearest Neighbours method (K-NN) is known as one of the best probability density function (pdf) estimator. A fast K-NN algorithm has been developed and tested on the TIMIT database with a gain in computational time of 99;8%. The K-NN decision principle has been assessed on a frame by frame phonetic identification. A method to integrate K-NN estimator pdf in a HMM-based system is proposed and tested on an acoustic-phonetic decoding task. Finally, preliminary experiments are performed on the HMM topology inference .

A1257.pdf

TOP

SPECTRAL METHODS FOR VOICE SOURCE PARAMETERS ESTIMATION

Authors: Boris Doval Christophe d'Alessandro Benoit Diard

LIMSI-CNRS, BP 133, F91403 Orsay, France. E-mail: doval@limsi.fr cda@limsi.fr diard@limsi.fr

Volume 1 pages 533 - 536

ABSTRACT

A spectral approach is proposed for voice source parameters representation and estimation. Parameter estimation is based on decomposition of the periodic and the aperiodic components of the speech signal, and on spectral modelling of the periodic component. The paper focusses on parameters estimation for the periodic component of the glottal flow. A new anticausal all-pole model of the glottal flow is derived. Glottal flow is seen as an anticausal 2-pole filter followed by a spectral tilt filter. The anticausal filter has complex poles, instead of the real poles that are usually assumed. Time-domain and frequency domain parameters are linked by analytic formulas. Two spectral domain algorithms are proposed for estimation of open quotient. The first one is based on measurement of the first harmonics, and the second one is based on spectral modelling. Experimental results demonstrate the accuracy of the estimation procedures

A1353.pdf