Session TAA Speech Analysis & Modelling

Chairperson Pierre Badin ICP, INPG, France

Home


ACOUSTIC AND PERCEPTUAL PROPERTIES OF PHONEMES IN CONTINUOUS SPEECH AS A FUNCTION OF SPEAKING RATE

Authors: Hisao Kuwabara

Department of Electronics and Information Science Teikyo University of Science & Technology Uenohara, Kitatsuru-gun, Yamanashi 409-01, Japan Tel. +81.554.63.4411, Fax. +81.554.63.4431, E-mail: kuwabara@ntu.ac.jp

Volume 2 pages 1003 - 1006

ABSTRACT

An investigation has been made for individual phonemes focusing mainly on their duration in continuous speech spoken in different rates: fast, normal, and slow. Fifteen short sentences uttered by four male speakers have been used as the speech material which comprises a total of 291 morae. Normal speaking rate (n-speech) is, on average, 150 milliseconds/mora (or 400 morae/minute) and the four speakers have been asked to read the sentences twice as fast as (f-speech) and 1/2 times as slow as (s-speech) the normal speed in reference to the n-speech. Among consonants, the greatest influence has been found to occur on the syllabic nasal /N/ and the least on the voiceless stop /t/ in f-speech. For the s-speech, /N/ has also been found to be the greatest but the least is voiced stop /d/. The ratio of duration between consonant and vowel of a CV-syllable in the f-speech is kept almost the same as that in the n-speech while vowel lengthening becomes significantly large in the s-speech. As it is expected, formant frequencies of vowels differ significantly between the three rates. Five vowels tend to be close together on the F1-F2 plane as the speaking rate becomes fast reflecting the neutralization of vowels. However, average difference of the third formant has been found to be very small.

A0005.pdf

TOP


NEW RESULTS IN VOWEL PRODUCTION: MRI, EPG, AND ACOUSTIC DATA

Authors: Shrikanth Narayanan (l), AbeerAlwan (2), Yong Song (2),*

(1) AT&T Labs-Research, 180 Park Avenue, Florham Park, NJ 07932 (2) Department of Electrical Engineering, UCLA, 405 Hilgard Avenue, Los Angeles, CA 90095

Volume 2 pages 1007 - 1010

ABSTRACT

MRI, EPG, and acoustic data for the vowels /a, i, u/ are analyzed. The vocal tract geometry, tongue shapes, and inter-subject variability are studied. The data are used for studying the articulatory-acoustic relations for these sounds.

A0160.pdf

TOP


THE TEMPORAL PROPERTIES OF SPOKEN JAPANESE ARE SIMILAR TO THOSE OF ENGLISH

Authors: Takayuki Arai and Steven Greenberg

International Computer Science Institute 1947 Center Street, Berkeley, CA 94704 USA and University of California, Berkeley {arai,steveng}@icsi.berkeley.edu

Volume 2 pages 1011 - 1014

ABSTRACT

The languages of the world are generally classified into two types on the basis of their segmental timing. "Syllable-timed" languages, such as Japanese, are considered isochronous, exhibiting a highly regular pattern of syllabic duration. In contrast are the "stress-timed" languages, such as English, whose syllable timing varies greatly, both within and across sentential domains. The present study demonstrates that, even in a language as theoretically isochronous as Japanese, the duration of syllabic segments is as variable as their English counterparts. Moreover, the variability of moraic duration is as high as that observed for syllabic units. Two measures of segmental timing, syllable duration and the low-frequency modulation spectrum, indicate that the coarse temporal characteristics of English and Japanese are remarkably similar. Such common properties may reflect inherent temporal characteristics of physiological mechanisms underlying the production and perception of speech that are shared by all languages of the world.

A0173.pdf

TOP


THE AMPLITUDES OF THE PEAKS IN THE SPECTRUM: DATA FROM [a] CONTEXT

Authors: Anna Esposito

International Institute for Advanced Scientific Studies (IIASS) Via G. Pellegrino 19, I84019 Vietri sul Mare (SA), Italy e-mail:annesp@vaxsa.csied.unisa.it

Volume 2 pages 1015 - 1018

ABSTRACT

This work is devoted to the study of the properties of the sound spectrum at the release of Italian stop consonants in vocalic contexts.The aim is to check if the amplitudes of the peaks in the spectrum can be used as acoustic attributes of the place of articulation of the consonants. This information is useful for defining an automatic algorithm which can discriminate among different place of articulation using simple data such as the values, in dB, of the maximum peaks in different frequency ranges. Moreover, different measurements have been performed (the spectra are computed at the release, averaged over 10 msec after the release, and using a smoothed spectrum) in order to define which measure retains more information about peak amplitudes.

A0200.pdf

TOP


ACOUSTICAL CHARACTERISTICS OF SPEECH AND VOICE IN SPEECH PATHOLOGY

Authors: Natalija Bolfan - Stosic and Mladen Hedjever

Acoustic Laboratory for Speech and Hearing Department of Logopedics Faculty of Defectology University of Zagreb Kuslanova 59-a, 10000 Zagreb, Croatia Tel. +385 1 2338 022, FAX: +385 1 229 950, E-mail: mladen@antun.defekt.hr

Volume 2 pages 1019 - 1022

ABSTRACT

Thirty six hoarse voices of preschool children and fifty speech productions of school children was analyzed using an acoustic analysis by Bruel and Kjaer, Real-time Frequency Analyzer, Type 2123. Thirty six specific oscilograms of sustained vowel productions were divided to oscilograms shapes in three subgroups of specific shimmer values, inside the same group. The purpose of this part of our research is a help in recognition and usage of acoustical terms in diagnostic of disordered voices. Therefore, we obtained "staccato shimmer", the "narrow" total intensity or shimmer, and finally the "wide" shimmer with following oscillations of jitter. The differences in fundamental frequency and intensity between three subgroups of different shimmer were analysed using one-way variance analysis. The purpose of the second part of this research has been to examine and analyze temporal segments in normal and disordered speech. Temporal segments of school children's speech have been measured starting from the subsound level (VOT - voice onset time and SGD - stop gap duration), and also at the levels of sound, syllable and word. Principal axis analysis showed specific differences between normal and pathological speech in all types of variables.

A0359.pdf

figures

TOP


PRONUNCIATION MODELING APPLIED TO AUTOMATIC SEGMENTATION OF SPONTANEOUS SPEECH

Authors: Andreas Kipp, Maria-Barbara Wesenick, Florian Schiel

IPSK University of Munich kipjschieljwesenick@phonetik.uni-muenchen.de

Volume 2 pages 1023 - 1026

ABSTRACT

In this paper 1 two different models of pronunciation are presented: the first model is based on a rule set compiled by an expert, while the second is statistically based, exploiting a survey about pronunciation variants occurring in training data. Both models generate pronunciation variants from the canonic forms of words. The two models are evaluated by applying them to the task of automatic segmentation of speech and then comparing the results to manual segmentations of the same speech data. Results show that correspondence between manual and automatic segmentations can be significantly improved if pronunciation variants are taken into account. The statistical model outperforms the rule based model.

A0380.pdf

TOP


DYNAMIC AND STATIC IMPROVEMENTS TO LEXICAL BASEFORMS

Authors: Simon Downey and Richard Wiseman

downey@saltfarm.bt.co.uk, richard@saltfarm.bt.co.uk Speech Technology Unit, BT Laboratories, Martlesham Heath, Suffolk, UK.

Volume 2 pages 1027 - 1030

ABSTRACT

One limitation of many speaker independent recognition systems is their dependence on a single baseform dictionary to model word pronunciations. These dictionaries typically contain only a single (or 'ideal') pronunciation for each word. Previous work on improving dictionary models to include multiple pronunciations has met with mixed success - the alternatives may increase ambiguity in some cases. This paper investigates two approaches to improve lexical baseforms. The first is a 'bottom-up' approach in which 'ideal' transcriptions of utterances looked up in a pronunciation dictionary are compared to phonemic level hand-annotated transcriptions. Analysing the differences between the two transcriptions reveals many coomon mispronunciations, accent-based alternatives, false-starts and incorrect word substitutions. Each of these problems is illustrated in the paper, where it is also shown that unfamiliar words are prone to large numbers of alternative pronunciations. The second approach is more 'top-down'. Phonologically developed rules and transforms are described which modify the lexical representation of the utterance and a pronunciation network is thus derived. This approach has the advantage of being able to explicitly model cross-word coarticulation effects, whereas the former approach models them implicitly to a certain extent. The relative merits of each technique are investigated using a set of experiments performed on a phonetically rich database.

A0419.pdf

TOP


SIGNAL DRIVEN GENERATION OF WORD BASEFORMS FROM FEW EXAMPLES

Authors: Andreas Hauenstein

pc-plus GmbH Munich Germany E-mail: hauensteina@acm.org

Volume 2 pages 1031 - 1034

ABSTRACT

The work described in this paper attempts to automatically generate word baseforms as used in the pronunciation dictionaries of large vocabulary speech recognition systems. The input to the algorithm consists of several sample utterances per word. No additional information, like e.g. word spelling, is used. The task involves determining a suitable inventory of subword units (SWU) as well as determining the baseforms themselves. Experiments show that improvements over a triphone based dictionary are possible with less than ten sample utterances per word if test and training vocabularies are different. A possible application would be a system based on a fixed inventory of HMM-models that needs to be adapted to different vocabularies.

A0459.pdf

TOP


MODELLING THE ACOUSTIC DIFFERENCES BETWEEN L1 AND L2 SPEECH: THE SHORT VOWELS OF AFRIKAANS AND SOUTH AFRICAN ENGLISH.

Authors: Elizabeth C. Botha* and Louis C.W. Pols **

*Department of Electrical & Electronic Engineering, University of Pretoria, Pretoria 0002 South Africa. liesbeth.botha@ee.up.ac.za **Institute of Phonetic Sciences, University of Amsterdam, Herengracht 338, 1016 CG Amsterdam, The Netherlands. pols@fon.let.uva.nl

Volume 2 pages 1035 - 1038

ABSTRACT

The acoustic differences between Afrikaans and South African English, spoken as first (L1) and second (L2) language, are measured for nine short vowels. The spoken language data base of 22 male speakers, collected for comparative studies, is described. The features used in an initial comparison of the isolated vowels and vowels in CVC words are the first three formant values and ratios. Significant differences are found between the production of /e/ and /y/ by Afrikaans and English mother-tongue speakers, and to a lesser extent between /i/, /c/ and /u/. Several interesting trends that seem to contradict popular beliefs concerning South African accents are observed. Directions for future research and the application of the envisioned L1-L2 model in speech technology are given.

A0632.pdf

TOP


LARYNGEAL MOVEMENTS AND SPEECH RATE An X-ray investigation

Authors: Beatrice Vaxelaire & Rudolph Sock

Institut de Phonetique-Universite de Strasbourg ÉÉ 22 rue Descartes-67084 Strasbourg Cedex France e mail : vaxelair@ushs.u-strasbg.fr

Volume 2 pages 1039 - 1042

ABSTRACT

This is an investigation on the production of VCV sequences with special emphasis on the displacement of the unit larynx-hyoid bone. X-ray data obtained for two subjects at two speaking rates show: that there is a positive correlation between the displacement of the larynx and that of the hyoid bone; that larynx position is lower for high vowels compared with their lower counterparts; that the larynx adopts a higher initial position in fast speech. Varying speech rate allows to uncover robust laryngeal trajectories underlying the production of these VCV sequences.

A0754.pdf

TOP


HOW FLEXIBLE IS THE HUMAN VOICE? – A CASE STUDY OF MIMICRY.

Authors: Anders Eriksson and Pär Wretling

Department of Phonetics Umeå University, S-901 87 Umeå, Sweden E-mail: anderse@ling.umu.se and wretling@ling.umu.se

Volume 2 pages 1043 - 1046

ABSTRACT

The investigation presented here is a case study of mimicry in which a professional impersonation artist imitated three well-known Swedish public figures. The speech material consisted of recorded material taped from radio/TV shows, imitations of these speeches in which the artist tried to mimic the speeches as closely as possible, and the same speech material recorded with the artist using his own natural voice. The aim of the study was to investigate how closely the imitations matched selected acoustic parameters of the original recordings. It was found that he was able to mimic global speech rate very closely, but timing at the segmental level showed little or no change in the direction of the targets. Mean fundamental frequency and variation matched the targets very closely. Target formant frequencies were attained with varying success. For two of the three target voices the vowel space of the imitation was intermediate between that of the artist's own voice and the target. In the third case there was no apparent reduction in distance. With respect to individual vowels it was generally, but not always, the case that the formant frequencies of the mimicked vowels were closer to the original than those of the artist's own voice.

A1008.pdf

TOP


THE EFFECT OF LOW-PASS FILTERING ON ESTIMATED VOICE SOURCE PARAMETERS

Authors: Helmer Strik

University of Nijmegen, Dept. of Language and Speech P.O. Box 9103, 6500 HD Nijmegen, The Netherlands strik@let.kun.nl, http://lands.let.kun.nl/TSpublic/strik

Volume 2 pages 1047 - 1050

ABSTRACT

Voice source parameters are often obtained by parametrizing glottal flow signals. However, before parametrization these glottal flow signals are usually low-pass filtered. As low-pass filtering changes the shape of the glottal pulses, it will also cause an error in the estimated voice source parameters. The present article presents results of our research on the effect of low-pass filtering on the estimated voice source parameters. We will first present an evaluation method which makes it possible to study the effect of low-pass filtering in detail. The evaluation results show that low-pass filtering leads to an error in all estimated voice source parameters.. However, the magnitude of the errors differs for the various voice source parameters, and also depends on the estimation method used. We will show that the errors can be reduced substantially by choosing the appropriate estimation method.

A1023.pdf

TOP


VOWEL DEVELOPMENT OF /i/ AND /u/ IN 15-36 MONTH OLD CHILDREN AT RISK AND NOT AT RISK TO STUTTER

Authors: Susan M. Fosnot

University of California at Los Angeles Department of Linguistics, Phonetics Lab 4404 San Blas Avenue, Woodland Hills CA 91364, USA Tel. & Fax (818) 884-9110, E-mail:fosnot@HUMnet.ucla.edu

Volume 2 pages 1051 - 1054

ABSTRACT

A study was designed to compare the high front /i/ and high back /u/ vowel in children at risk and not at risk to stutter. Recordings were made of children playing with parents for 10 minutes between 15 and 36 months of age. Anatomical and linguistic influences did not differ across subjects with the exception of the 24 month period. At risk children were slightly taller at 24 months. Spontaneous utterances from each child were digitized into a CSL, Model 4300. The F1 and F2 of the steady-state portion of each /i/ and /u/ vowel was measured. Not-at-risk children demonstrated values typical of normally-developing children. Repeated measure ANOVAs showed that children who were at risk to stutter had significantly higher formant values for F1 for both /i/ and /u/ vowels. These results suggest that the tongue height is lower than it should be for the high vowels. Formant frequencies for F2 for both /i/ and /u/ were significantly higher also reflecting a more forward tongue position for the front and the back vowels in at-risk children.

A1043.pdf

TOP


Optopalatograph: Development of a device for measuring tongue movement in 3D

Authors: A. Wrench A. McIntosh W. Hardcastle

Dept. Speech and Language Sciences Queen Margaret College, Clerwood Terrace, Edinburgh EH12 8TS UK. Tel. +44 131 317 3692, FAX: +44 131 317 3689, E-mail: a.wrench@sls.qmced.ac.uk WWW: http://sls.qmced.ac.uk/

Volume 2 pages 1055 - 1058

ABSTRACT

This paper identifies and investigates potential sources of measurement error using a prototype of a device for measuring tongue-palate distance, contact and pressure across the whole of the hard palate. The Optopalatograph (OPG) is similar in principle to the Glossometer and similar in configuration to the Electropalatograph. It uses optical fibres to relay light to and from the palate and distance sensing is achieved by measuring the amount of light reflected from the surface of the tongue. A high power halogen light source is currently used to compensate for light attenuation and losses in the system. This source is not readily switchable and we evaluate the error in the measured light intensity when all the sources are on simultaneously. We conclude that the halogen-based OPG is a practical device with a worst case error of 10% in estimated distance values below 5mm

A1061.pdf

TOP


SPEECH SYNTHESIS AND PROSODY MODIFICATION USING SEGMENTATION AND MODELING OF THE EXCITATION SIGNAL

Authors: J.M. Gutiérrez Arriola, F.M. Giménez de los Galanes, M.H. Savoji, J.M. Pardo

Grupo de Tecnología del Habla, Departamento de Ingeniería Electrónica, E.T.S.I. Telecomunicación, Universidad Politécnica de Madrid Ciudad Universitaria. 28040, Madrid. Spain

Volume 2 pages 1059 - 1062

ABSTRACT

In previous work we have presented a new method for improving the quality of LPC synthetic speech, where the excitation signal was modelled with a polynomial function followed by an adaptive filter. This scheme provides the properties of mathematical models which permits avoiding the problems related to prosody control [1], [2]. In order to reduce the storage needs, a segmentation technique was developed which grouped together several pitch periods based on spectral similarity. For every segment the same coefficient set (both the polynomial function and the post-processing filter) was used. These techniques were applied to a codification/decodification task were the resulting speech quality was promising [1], [2]. In this paper we present some results concerning prosodic modification, i.e. duration and fundamental frequency arbitrary changes which show the suitability of these methods for text-to-speech applications. We also present some results of the extension of the model to unvoiced segments of speech.

A1093.pdf

TOP


How can the control of the vocal tract limit the speaker's capability to produce the ultimate perceptive objectives of speech ?

Authors: Christophe Savariaux, Louis-Jean Boë & Pascal Perrier

Institut de la Communication Parlée - UPRESA CNRS 5009 INPG & Université Stendhal 46 Avenue Félix Viallet - F - 38031 Grenoble Cédex 01 - France savario; boe, perrier@cristal.icp.grenet.fr

Volume 2 pages 1063 - 1066

ABSTRACT

In this paper an extension of the lip-tube experiment proposed by Savariaux et al. (1990) is presented and analyzed. The question underlying the design of this experiment is whether speakers are able to produce an [u] with a large lip opening. Nine native speakers of French repeated the original experiment, and then were asked to produce the vowel [u] starting from [o] vocal tract configuration. It was shown that more subjects achieved the compensation when they shifted their articulation from [o] to [u]. The issue of a possible constraint imposed by a learned standard articulatory pattern is discussed in relation with the notion of the internal representation of the articulatory-to-acoustic relations. Proposals in favor of a standard pattern for [u] that would be velopalatal rather than velopharyngeal are discussed.

A1107.pdf

TOP


A STEP TOWARD GENERAL MODEL FOR SYMBOLIC DESCRIPTION OF THE SPEECH SIGNAL

Authors: Dr Goran S. Jovanovic

Institute for Applied Mathematics and Electronics Mathematical Institute of Serbian Academy of Arts and Sciences Kneza Milo{a 37, 11000 Beograd, SR Yugoslavia FAX: +381 11 186105, E-mail: jovanovicg@buef31.etf.bg.ac.yu

Volume 2 pages 1067 - 1070

ABSTRACT

The paper presents an improved and extended version of previously defined general model for symbolic description of the speech signal. In the first part of the paper we formally define symbolic description segments that correspond to the lower speech coding levels (word and subword speech signal segments). In the second part of the paper we perform an analysis of practical applicability of the proposed model. Experimental evidence confirmed that one way to develop automatic procedure for symbolic description of the speech signal is by the use of IFC-guided speech signal processing, which provides specific focusing structural analysis. We believe that presented experimental results are inspiring from the standpoint of new research projects, especially in the field of automatic speech recognition and efficient speech coding.

A1177.pdf

TOP


Referring in Long Term Speech by using Orientation Patterns Obtained from Vector Field of Spectrum Pattern

Authors: Kiyoshi Furukawa, Masayuki Nakazawa, Takashi Endo and Ryuichi Oka

Tsukuba Research Center, Real World Computing Partnership Tsukuba Mitsui building 13F, 1-6-1 Takezono, Tsukuba-shi, 305 Ibaraki, JAPAN Tel:+81-298-53-1660 FAX:+81-298-53-1740 e-mail:furu@rwcp.or.jp

Volume 2 pages 1071 - 1074

ABSTRACT

We proposed a new expression of speech feature called orientation patterns which keeps its ability of detection higher in averaging of time domain. Because of this, we achieved to reduce number of frames in reference and input pattern in DP matching algorithm, then the calculation load were reduced. We constructed long term speech retrieval system by using this new expression. This system has RIFCDP as base matching algorithm which was already proposed. RIFCDP is an algorithm for spotting similar intervals between arbitrary reference pattern and arbitrary input pattern sequence synchronously with input frames.

A1252.pdf

TOP