Session ThMB Speech Perception

Chairperson Paul Taylor Univ. of Edinburgh, UK

Home


PRELIMINARY EXPERIMENTS ON THE PERCEPTION OF DOUBLE SEMIVOWELS

Authors: W.A.Ainsworth and G.F.Meyer

Centre for Human and Machine Perception Research Department of Communication and Neuroscience Keele University Keele, Staffordshire ST5 5BG, England w.a.ainsworth@keele,ac.uk

Volume 4 pages 2115 - 2118

ABSTRACT

A number of previous studies have shown that it is possible to recognise two vowel sounds spoken simultaneously if their pitches, onsets or the spatial locations of their sources differ sufficiently. Experiments have been carried out to explore the perception of isolated syllables containing the glides /w/ and /j/ spoken at the same time. It was found that consonants in concurrent syllables which contained the same vowel but were spoken with different pitches could not be reliably identified. However if the vowels and their pitches differed, the glides could be recognised about 70% of the time. This suggests that the neuronal mechanisms underlying the separation of simultaneous consonants employ other features as well as pitch differences

A0021.pdf

TOP


DOES SYLLABLE FREQUENCY AFFECT PRODUCTION TIME IN A DELAYED NAMING TASK?

Authors: Niels O. Schiller

Max Planck Institute for Psycholinguistics P. O. Box 310, 6500 AH Nijmegen, The Netherlands Tel. +31-24-3521911, FAX: +31-24-3521213, e-mail: schiller@mpi.nl

Volume 4 pages 2119 - 2122

ABSTRACT

In a delayed naming task the effect of syllable frequency on the production time of syllables was investigated. Participants first heard either a low- or a high-frequency syllable and were then asked to repeat this syllable as often as they could for a time span of eight seconds. Mean production times per syllable were determined. When the segmental make-up of high- and low-frequency syllables was completely matched, there was no frequency effect on production time. It is concluded that syllable frequency does not play a role on the articulatory-motor level in speech production.

A0048.pdf

TOP


HUMAN AND MACHINE IDENTIFICATION OF CONSONANTAL PLACE OF ARTICULATION FROM VOCALIC TRANSITION SEGMENTS

Authors: Andrew C Morris*, Gerrit Bloothooft**, William J Barry***, Bistra Andreeva***, Jacques Koreman***

* Speech and Hearing Research Group, Sheffield University, UK, Tel. +44 (0)114 222 1907, E-mail: a.morris@dcs.shef.ac.uk ** Utrecht Institute of Linguistics OTS, Holland, Tel. +31.30.2536042, E-mail: gerrit.bloothooft@let.ruu.nl *** Institute of Phonetics, Univer ken, Germany, Tel. +49(0)681.3024500, E-mail: wbarry@coli.uni-sb.de

Volume 4 pages 2123 - 2126

ABSTRACT

One of the most dif ficult problems in the first stages of automatic speech recognition (ASR) is the identification of consonantal place of articulation (CPA). It is kno wn that the acoustic correlates for CPA reside largely in the pattern of formant transitions preceding v ocal tract closure and following release, b ut common speech preprocessing techniques make only a limited attempt to capture these spectral dynamics in the representation which they pass on for recognition. In order to test alternative preprocessing strategies, we have prepared a multilingual set of VC and CV v ocalic transition segments and then compared the baseline performance of human perception of CP A in this dataset with the performance of tw o common ASR techniques. Representaions initially tested were concatenated mel cepstra and mel ceptra plus cepstral differences.

A0236.pdf

TOP


MODELLING THE RECOGNITION OF SPECTRALLY REDUCED SPEECH

Authors: Jon Barker and Martin Cooke

fj.barker,m.cookeg@dcs.shef.ac.uk Department of Computer Science, University of Sheffield, Sheffield, UK

Volume 4 pages 2127 - 2130

ABSTRACT

Progress in robust automatic speech recognition may benefit from a fuller account of the mechanisms and representations used by listeners in processing distorted speech. This paper reports on a number of studies which consider how recognisers trained on clean speech can be adapted to cope with a particular form of spectral distortion, namely reduction of clean speech to sine-wave replicas. Using the Resource Management corpus, the first set of recognition experiments confirm the high information content of sine-wave replicas by demonstrating that such tokens can be recognised at levels approaching those for natural speech if matched conditions apply during training. Further recognition tests show that sine-wave speech can be recognised using natural speech models if a spectral peak representation is employed in concert with occluded speech recognition techniques.

A0238.pdf

TOP


PROSODIC STRUCTURE AND PHONETIC PROCESSING: A CROSS-LINGUISTIC STUDY

Authors: Christophe Pallier (1),(2), Anne Cutler (1) and Núria Sebastián-Gallés (2)

(1) Max-Planck-Institute for Psycholinguistics, Nijmegen, The Netherlands. (2) Departament de Psicologia Bàsica, Universitat de Barcelona, Spain. E-mail : pallier@lscp.ehess.fr

Volume 4 pages 2131 - 2134

ABSTRACT

Dutch and Spanish differ in how predictable the stress pattern is as a function of the segmental content: it is correlated with syllable weight in Dutch but not in Spanish. In the present study, two experiments were run to compare the abilities of Dutch and Spanish speakers to separately process segmental and stress information. It was predicted that the Spanish speakers would have more difficulty focusing on the segments and ignoring the stress pattern than the Dutch speakers. The task was a speeded classification task on CVCV syllables, with blocks of trials in which the stress pattern could vary versus blocks in which it was fixed. First, we found interference due to stress variability in both languages, suggesting that the processing of segmental information cannot be performed independently of stress. Second, the effect was larger for Spanish than for Dutch, suggesting that that the degree of interference from stress variation may be partially mitigated by the predictability of stress placement in the language.

A0288.pdf

TOP


THE CORRELATION BETWEEN CONSONANT IDENTIFICATION AND THE AMOUNT OF ACOUSTIC CONSONANT REDUCTION

Authors: R.J.J.H. van Son & Louis C. W. Pols

Institute for Phonetic Sciences / IFOTT, University of Amsterdam, Herengracht 338, NL-1016CG Amsterdam, The Netherlands, E-mail: {rob, pols}@fon.let.uva.nl

Volume 4 pages 2135 - 2138

ABSTRACT

Reduction causes changes in the acoustics of consonant realizations that affect their identification. In this study we try to identify some of the acoustic parameters that are correlated with this change in identification. Speaking style is used to manipulate the degree of reduction. Pairs of otherwise identical intervocalic consonants from read and spontaneous utterances are presented to subjects in an identification experiment. The resulting identification scores are correlated to five different acoustical measures that are affected by the amount of consonant reduction: Segmental duration, spectral Center of Gravity, intervocalic sound energy difference, intervocalic F2 slope difference, and the amount of vowel reduction in the syllable kernel. The identification differences between the read and spontaneous realizations are compared with the differences in each of the acoustic measures. It showed that only segmental duration and the spectral Center of Gravity are significantly correlated to identification scores.

A0301.pdf

TOP


RELEVANT SPECTRAL INFORMATION FOR THE IDENTIFICATION OF VOWEL FEATURES FROM BURSTS

Authors: A. Bonneau

CRIN–CNRS & INRIA Lorraine, Bˆ atiment LORIA, BP 239, 54506 Vandoeuvre-l` es-Nancy FRANCE. Tel. (33) 3 83 59 20 80, FAX: (33) 3 83 41 30 79, E-mail:bonneau@loria.fr

Volume 4 pages 2139 - 2142

ABSTRACT

This paper presents a first attempt to extract relevant spectral information for the identification of the following vocalic context from French stop bursts. For this purpose, we studied the acoustic spectra of bursts used in a perceptual experiment which showed that listeners were able to identify vocalic features from bursts [1]. The corpus was made up of stimuli of 20-25 ms duration extracted from natural monosyllabic words which combined the initial stops /p,t,k/ with the vowels /i,a,u/. The low frequency limit of the frication noise as well as the frequency of the most prominent peak of the burst appeared to be very interesting cues for the identification of the vocalic context. Using these cues, most of contexts (/i/ from /t,k/, /u/ from /p,k/ and /a/ from /p,t/) have been very well classified without specification of the consonant.

A0468.pdf

TOP


PERCEPTUAL STUDY OF INTERSYLLABIC FORMANT TRANSITIONS IN SYNTHESIZED V1-V2 IN STANDARD CHINESE

Authors: Li, Aijun

Institute of Linguistics Chinese Academy of Social Sciences 5 JianNeiDaJie, 100732, Beijing, P.R.China Email: linmc@sun.ihep.ac.cn Tel: 086-010-65237408

Volume 4 pages 2143 - 2146

ABSTRACT

Transition between vowels is related to speech continuity[8]. Research shows that the formant intensity between syllables varies in Standard Chinese (SC)[5]. We can classify the intensity of intersyllabic formant juncture/transition into three categories from strong to weak by using the consonant of the second syllable. Different categories take various roles in synthesizing speech: the more intense the formant transition is, the more important role it will take in the synthesis. This paper reports the results of perceptual experiments on intersyllabic formant transitions of one of the categories when the second syllable is a zero-initial syllable ( i.e. begins with a vowel ).

A0475.pdf

TOP


Role of perception of rhythmically organized speech in consolidation process of long-term memory traces (LTM-traces) and in speech production controlling

Authors: Skljarov O.P.

Research Institute of Otolaryngology and Speech Pathology 198013, Bronnitskaja, 9, St.-Petersburg, RUSSIA e-mail: vigarb@thewall.ioffe.rssi.ru telephone: +110 1010

Volume 4 pages 2147 - 2150

ABSTRACT

In this paper, proceeding from established during comparative research of speech of the normal persons and stutterers equations of a logistical type [1,2,3], hypothesis about discrete character of per- ception of a speech signal, occurrence of homeostatic state of speech reproduction system (speech memory), and also about ways revealing of such memory are offered. The results prove to be true by being available experimental data.

A0482.pdf

TOP


SEQUENTIAL PROBABILITIES AS A CUE FOR SEGMENTATION

Authors: Arie H. van der Lugt

Max Planck Institute for Psycholinguistics, Wundtlaan1, 6525 XD Nijmegen, The Netherlands Tel. +31243521911, FAX +31243521213, E-mail: vdlugt@mpi.nl

Volume 4 pages 2151 - 2154

ABSTRACT

A large amount of psycholinguistic research, phonetic research and research in speech technology has been dedicated to the problem of segmentation: how is speech segmented into words? The work reported here extends earlier findings by McQueen & Cox ([1]), who found that phonotactics are used by listeners as a cue to the location of word boundaries. The present investigation addresses the question of whether people can also use less extreme sequential probabilities as a segmentation cue. Hearing a combination of sounds that often occurs at the end of a word or syllable may facilitate recognition of a following word; hearing a combination of sounds that occurs often at the beginning of a word or syllable may facilitate recognition of a preceding word. In a word-spotting task some indications were found that people are sensitive to sequential probabilities. However, no effects were found that strongly support the hypothesis that people do indeed use these distributional properties of the lexicon in the segmentation of spoken language.

A0495.pdf

TOP


Perception and acoustics of emotions in singing

Authors: Susan Jansens, Gerrit Bloothooft, and Guus de Krom

Computer and Humanities Department / Utrecht Institute of Linguistics-OTS University of Utrecht, Trans 10, 3512 JK Utrecht, the Netherlands Tel: + 31 30 2536059, Fax: + 31 30 2536000, E-mail: Guus.deKrom@let.ruu.nl

Volume 4 pages 2155 - 2158

ABSTRACT

In this experiment, the acoustic correlates of perceived emotions in singing were investigated. Singers were instructed to sing one phrase in a neutral way and in the emotions anger, joy, fear, and sadness. Listeners rated the strength of the perceived emotions for each fragment. Principal component analyses were performed on the listeners' ratings. The derived factors were interpreted as listening strategies; and a listener's factor loading as an indicator of the extent to which that listener used that strategy. Using the original ratings and the factor loadings, the phrases were assigned composite ratings for each emotion. Acoustic measures of spectral balance, vibrato, duration and intensity were related to the composite ratings using multiple regression analyses. It was found that anger was associated with the presence of vibrato; joyous phrases had vibrato, a short final duration, and a shallow spectral slope; sadness was associated with absence of vibrato, long duration, and a low intensity, whereas fear was related to a steep spectral slope.

A0496.pdf

TOP


PHONEMES AND SYLLABLES IN SPEECH PERCEPTION: SIZE OF ATTENTIONAL FOCUS IN FRENCH

Authors: Christophe Pallier

Max-Planck Institute for Psycholinguistics, Nijmegen, The Netherlands & Laboratoire de Sciences Cognitives et Psycholinguistique, EHESS-CNRS, Paris, France E-mail: pallier@lscp.ehess.fr

Volume 4 pages 2159 - 2162

ABSTRACT

A study by Pitt and Samuel (1990) found that English speakers could narrowly focus attention onto a precise phonemic position inside spoken words [1]. This led the authors to argue that the phoneme, rather than the syllable, is the primary unit of speech perception. Other evidence, obtained with a syllable detection paradigm, has been put forward to propose that the syllable is the unit of perception; yet, these experiments were ran with French speakers [2]. In the present study, we adapted Pitt & Samuel's phoneme detection experiment to French and found that French subjects behave exactly like English subjects: they too can focus attention on a precise phoneme. To explain both this result and the established sensitivity to the syllabic structure, we propose that the perceptual system automatically parses the speech signal into a syllabically-structured phonological representation.

A0602.pdf

TOP


Quality of a vowel with formant undershoot: a preliminary perceptual study

Authors: Shinichi TOKUMA

Phonetics Laboratory, Sophia University 7-1, Kioi-cho, Chiyoda Tokyo 102 Japan e-mail: s-tokuma@hoffman.cc.sophia.ac.jp

Volume 4 pages 2163 - 2166

ABSTRACT

In this study vowels in /CVC/ environments are compared with steady state vowels to investigate the perceived vowel quality change caused by undershoot. This study uses a perceptual task, whereby listeners match constant /CVC/ stimuli of /bVb/ or /dVd/ to variable /#V#/ stimuli, using a schematic grid on a PC screen. The grid represents an acoustic vowel diagram, and the subjects change the F1/F2 frequencies of /#V#/ by moving a mouse. The main results of the study show that while subjects referred to the trajectory peak of the /CVC/ stimuli in vowel quality perception, their performance was also affected by the formant trajectory range of the stimuli. When the formant trajectory range was small, they selected a value between the edge and peak frequencies, while they selected a value outside the trajectory range when it was large.

A0621.pdf

TOP


SEGMENTAL AND SUPRASEGMENTAL CONTRIBUTIONS TO SPOKEN-WORD RECOGNITION IN DUTCH

Authors: Mariëtte Koster and Anne Cutler

Max Planck Institute for Psycholinguistics Postbus 310, 6500 AH Nijmegen, The Netherlands Tel +31 24 352 1911; Email anne.cutler@mpi.nl

Volume 4 pages 2167 - 2170

ABSTRACT

Words can be distinguished by segmental differences or by suprasegmental differences or both. Studies from English suggest that suprasegmentals play little role in human spoken-word recognition; English stress, however, is nearly always unambiguously coded in segmental structure (vowel quality); this relationship is less close in Dutch. The present study directly compared the effects of segmental and suprasegmental mispronunciation on word recognition in Dutch. There was a strong effect of suprasegmental mispronunciation, suggesting that Dutch listeners do exploit suprasegmental information in word recognition. Previous findings indicating the effects of mis-stressing for Dutch differ with stress position were replicated only when segmental change was involved, suggesting that this is an effect of segmental rather than suprasegmental processing.

A0722.pdf

TOP


PERCEPTION OF VOWEL DURATION AND SPECTRAL CHARACTERISTICS IN SWEDISH

Authors: Dawn M. Behne (1) Peter E. Czigler (2) and Kirk P. H. Sullivan (2)

dawn.behne@hf.ntnu.no Norwegian University of Science and Technology 7055 Dragvoll, Norway Tel: +47 73 59 83 09 Fax: +47 73 67 70 (2) czigler@ling.umu.se kirk@ling.umu.se Umea University S-901 87 Umea, Sweden Tel: +46 90 16 63 67 Fax: +46 90 16 63 77

Volume 4 pages 2171 - 2174

ABSTRACT

This project re-examines the perceptual weight of vowel duration and the first two vowel formant frequencies as determinants of phonologically short and long vowels in Swedish. Based on listeners' responses to synthesized sets of materials for [I]-[i:] []-[o:] and [a]-[a:], results indicate that vowel duration is of primary importance for distinguishing [I) from [i:] and [] from [o:], whereas both formant frequencies and vowel duration were found to influence the perception of [a] from [a:].

A0731.pdf

TOP


RELATIVE CONTRIBUTIONS OF NOISE BURST AND VOCALIC TRANSITIONS TO THE PERCEPTUAL IDENTIFICATION OF STOP CONSONANTS

Authors: Adrien Neagu and Gérard Bailly

Institut de la Communication Parlée 46, av. Félix Viallet 38031 Grenoble CEDEX FRANCE e-mail: (neagu,bailly)@icp.grenet.fr

Volume 4 pages 2175 - 2178

ABSTRACT

A set of three perceptual experiments is described. These experiments were designed to provide identification scores on CV sequences for French. Original stimuli were augmented with acoustic "monsters" where burst were excised or replaced. The first identification task shows that information carried by vocalic transitions can be overwritten by burst information. The importance of this phenomenon is inversely proportional to vowel aperture. The second experiment shows that these results are almost insensitive to relative amplitudes between the burst and the vowel. In the third experiment we manipulated the voice onset time (VOT) of the monsters using high quality analysis-resynthesis. Stimuli with a very short VOT were perceived as bilabials but VOT manipulation did not affect the /t/-/k/ confusions. These experiments claim for a dynamic model of stop identification where burst and vocalic transitions both contribute and compete to the phonetic decision.

A0787.pdf

Recordings

TOP


EFFECT OF SPEAKER FAMILIARITY AND BACKGROUND NOISE ON ACOUSTIC FEATURES USED IN SPEAKER IDENTIFICATION

Authors: Satoshi Kitagawa, Makoto Hashimoto and Norio Higuchi

e-mail: satoshi@itl.atr.co.jp ATR Interpreting Telecommunications Research Labs. 2-2 Hikaridai, Seika-cho, Soraku-gun, 619-02 Kyoto, Japan

Volume 4 pages 2179 - 2182

ABSTRACT

In order to investigate the relationship between human perception in speaker identification and acoustic features (fundamental frequency (f0), spectrum, and duration) under various communication conditions, this paper describes several perception experiments and an approach to predict the perceptual contribution rate of each feature. Factors taken into account in this paper are: (1) speaker familiarity and (2) background noise. As a result, it is shown that: (1) the perceptual contribution rate increases as the distance of anacoustic feature increases, (2) the spectral contribution rates for familiar speakers are larger than those for unfamiliar speakers, (3) the contribution of f0 tends to increase as the noise increases, and (4) in case of the same S/N ratio, the contribution of f0 in the computer room noise environment is larger than in the car noise environment.

A0795.pdf

TOP


DYNAMIC VERSUS STATIC SPECIFICATION FOR THE PERCEPTUAL IDENTITY OF A COARTICULATED VOWEL

Authors: Michel Pitermann

Laboratoire Parole et Langage, ESA 6057 CNRS Université de Provence, 29 Ave. Robert Schuman 13621 Aix-en-Provence Cedex, France mpiter@lpl.univ-aix.fr

Volume 4 pages 2183 - 2186

ABSTRACT

This paper presents a perceptual experiment on stimuli synthesized by means of a vocal tract area function model. The purpose was to compare the contribution of dynamic against static information to the identity of a coarticulated vowel. Three sources of information were perceptually analyzed: (i) the vowel nucleus; (ii) the acoustical contrast between the vowel nucleus and the stationary parts of its immediate context; (iii) and the transitions linking the stable parts of the speech signal. The results show that the vocoïds were better identied by dynamic information. This backs up the perceptual overshoot model proposed in Lindblom and Studdert-Kennedy (1967). However, this conclusion must be confirmed by further experiments.

A0821.pdf

TOP


Asymmetries in Consonant Confusion

Authors: Madelaine Plauché (1), Cristina Delogu (2), and John J. Ohala (1)

(1) University of California at Berkeley, U.S.A, mcp@socrates.berkeley.edu; ohala@cogsci.berkeley.edu (2) Fondazione Ugo Bordoni, cristina@fub.it

Volume 4 pages 2187 - 2190

ABSTRACT

Both historical sound change and laboratory confusion studies show strong asymmetries of consonant confusions. In particular, /ki/ commonly changes to /ti/, and /pi/ to /ti/, but not the reverse. It is hypothesized that such asymmetries arise when two sounds are acoustically similar except for one or more differentiating cues, which are subject to a highly directional perceptual error. This perceptual entropy can be explained as follows: if sound x possesses a cue that y lacks, listeners are more likely to miss this "all-or-none" cue than to introduce it spuriously. /k/ and /t/ before /i/ have similar formant transitions but differ in their burst spectra. /p/ and /t/ before /i/ also have similar formant transitions but differ in the intensity of their bursts. The importance of these differentiating features for listeners' perception were verified in a confusion study. The implications of the inversely related effects of perceptual and physical entropy for phonetic theory and speech technology is discussed.

A0946.pdf

TOP


Rime and syllabic effects in phonological priming between French spoken words

Authors: Nicolas Dumay 1,2 & Monique Radeau 2,3

1 Laboratoire de Psycholinguistique Expérimentale, FAPSE - Université de Genève - 9 route de Drize - CH-1227 Carouge - SUISSE tél: (41) 22 705 97 22 - e-mail: ndumay@ulb.ac.be 2 Laboratoire de Psychologie Expérimentale, Université Libre de Bruxelles - 50 avenue F.D. Roosevelt - CP-191 - 1050 Bruxelles - BELGIQUE tél: (32) 2 650 25 39 - e-mail: moradeau@ulb.ac.be 3 Belgian National Founds of Scientific Research

Volume 4 pages 2191 - 2194

ABSTRACT

Phonological priming between spoken words was examined using CVCVC bisyllabic pseudoword primes and word or pseudoword targets. The influence of different types of overlap was compared, prime and target sharing the coda, the rime or the final syllable. The task was target shadowing. Two priming conditions were used, the auditory targets being preceded by auditory primes in unimodal and by visual primes in crossmodal situation. Priming effects were obtained under unimodal stimulation only. A strong facilitation occurred with syllable overlap while a smaller facilitation was found with rime overlap. Coda overlap produced no effect. The absence of effect in crossmodal stimulation argues that the final overlap effects occur before the semantic system. Concerning the underlying units, a comparison of our results with those obtained from CCVC monosyllables with overlaps in phonemic length similar to those we used, suggests that both rime and syllabic units per se are involved in the effects of final similarity between spoken words.

A1019.pdf

Recordings

TOP


ROLES OF STATIC AND DYNAMIC FEATURES OF FORMANT TRAJECTORIES IN THE PERCEPTION OF TALK INDEDIVDUALITY

Authors: Weizhong Zhu and Hideki Kasuya

Kasuya Lab Faculty of Engineering, Utsunomiya University 2753 Ishii-machi, Utsunomiya, 3221 Japan. TEL & FAX: +81 28 689 6122, E-mail: zhu@klab.ishii.utsunomiya-u.ac.jp

Volume 4 pages 2195 - 2198

ABSTRACT

Experiments were performed to investigate perceptual contributions of static and dynamic features of vocal tract characteristics to talker individuality. An ARX (Auto-regressive with exogenous input) speech production model was used to extract separately voice source and vocal tract parameters from a Japanese sentence, /aoiueoie/ ("Say blue top" in English). The Discrete Cosine Transform (DCT) was applied to resolve formant trajectories of the speech signal into static and dynamic components. The perceptual contributions were quantitatively studied by systematically replacing the corresponding formant components extracted from Japanese sentences uttered by three males. Results of the experiments show that the static (average) characteristic of the vocal tract is a primary cue to talker individuality.

A1216.pdf

TOP