ABSTRACT
A number of previous studies have shown that it is possible to recognise two vowel sounds spoken simultaneously if their pitches, onsets or the spatial locations of their sources differ sufficiently. Experiments have been carried out to explore the perception of isolated syllables containing the glides /w/ and /j/ spoken at the same time. It was found that consonants in concurrent syllables which contained the same vowel but were spoken with different pitches could not be reliably identified. However if the vowels and their pitches differed, the glides could be recognised about 70% of the time. This suggests that the neuronal mechanisms underlying the separation of simultaneous consonants employ other features as well as pitch differences
ABSTRACT
In a delayed naming task the effect of syllable frequency on the production time of syllables was investigated. Participants first heard either a low- or a high-frequency syllable and were then asked to repeat this syllable as often as they could for a time span of eight seconds. Mean production times per syllable were determined. When the segmental make-up of high- and low-frequency syllables was completely matched, there was no frequency effect on production time. It is concluded that syllable frequency does not play a role on the articulatory-motor level in speech production.
ABSTRACT
One of the most dif ficult problems in the first stages of automatic speech recognition (ASR) is the identification of consonantal place of articulation (CPA). It is kno wn that the acoustic correlates for CPA reside largely in the pattern of formant transitions preceding v ocal tract closure and following release, b ut common speech preprocessing techniques make only a limited attempt to capture these spectral dynamics in the representation which they pass on for recognition. In order to test alternative preprocessing strategies, we have prepared a multilingual set of VC and CV v ocalic transition segments and then compared the baseline performance of human perception of CP A in this dataset with the performance of tw o common ASR techniques. Representaions initially tested were concatenated mel cepstra and mel ceptra plus cepstral differences.
ABSTRACT
Progress in robust automatic speech recognition may benefit from a fuller account of the mechanisms and representations used by listeners in processing distorted speech. This paper reports on a number of studies which consider how recognisers trained on clean speech can be adapted to cope with a particular form of spectral distortion, namely reduction of clean speech to sine-wave replicas. Using the Resource Management corpus, the first set of recognition experiments confirm the high information content of sine-wave replicas by demonstrating that such tokens can be recognised at levels approaching those for natural speech if matched conditions apply during training. Further recognition tests show that sine-wave speech can be recognised using natural speech models if a spectral peak representation is employed in concert with occluded speech recognition techniques.
ABSTRACT
Dutch and Spanish differ in how predictable the stress pattern is as a function of the segmental content: it is correlated with syllable weight in Dutch but not in Spanish. In the present study, two experiments were run to compare the abilities of Dutch and Spanish speakers to separately process segmental and stress information. It was predicted that the Spanish speakers would have more difficulty focusing on the segments and ignoring the stress pattern than the Dutch speakers. The task was a speeded classification task on CVCV syllables, with blocks of trials in which the stress pattern could vary versus blocks in which it was fixed. First, we found interference due to stress variability in both languages, suggesting that the processing of segmental information cannot be performed independently of stress. Second, the effect was larger for Spanish than for Dutch, suggesting that that the degree of interference from stress variation may be partially mitigated by the predictability of stress placement in the language.
ABSTRACT
Reduction causes changes in the acoustics of consonant realizations that affect their identification. In this study we try to identify some of the acoustic parameters that are correlated with this change in identification. Speaking style is used to manipulate the degree of reduction. Pairs of otherwise identical intervocalic consonants from read and spontaneous utterances are presented to subjects in an identification experiment. The resulting identification scores are correlated to five different acoustical measures that are affected by the amount of consonant reduction: Segmental duration, spectral Center of Gravity, intervocalic sound energy difference, intervocalic F2 slope difference, and the amount of vowel reduction in the syllable kernel. The identification differences between the read and spontaneous realizations are compared with the differences in each of the acoustic measures. It showed that only segmental duration and the spectral Center of Gravity are significantly correlated to identification scores.
ABSTRACT
This paper presents a first attempt to extract relevant spectral information for the identification of the following vocalic context from French stop bursts. For this purpose, we studied the acoustic spectra of bursts used in a perceptual experiment which showed that listeners were able to identify vocalic features from bursts [1]. The corpus was made up of stimuli of 20-25 ms duration extracted from natural monosyllabic words which combined the initial stops /p,t,k/ with the vowels /i,a,u/. The low frequency limit of the frication noise as well as the frequency of the most prominent peak of the burst appeared to be very interesting cues for the identification of the vocalic context. Using these cues, most of contexts (/i/ from /t,k/, /u/ from /p,k/ and /a/ from /p,t/) have been very well classified without specification of the consonant.
ABSTRACT
Transition between vowels is related to speech continuity[8]. Research shows that the formant intensity between syllables varies in Standard Chinese (SC)[5]. We can classify the intensity of intersyllabic formant juncture/transition into three categories from strong to weak by using the consonant of the second syllable. Different categories take various roles in synthesizing speech: the more intense the formant transition is, the more important role it will take in the synthesis. This paper reports the results of perceptual experiments on intersyllabic formant transitions of one of the categories when the second syllable is a zero-initial syllable ( i.e. begins with a vowel ).
ABSTRACT
In this paper, proceeding from established during comparative research of speech of the normal persons and stutterers equations of a logistical type [1,2,3], hypothesis about discrete character of per- ception of a speech signal, occurrence of homeostatic state of speech reproduction system (speech memory), and also about ways revealing of such memory are offered. The results prove to be true by being available experimental data.
ABSTRACT
A large amount of psycholinguistic research, phonetic research and research in speech technology has been dedicated to the problem of segmentation: how is speech segmented into words? The work reported here extends earlier findings by McQueen & Cox ([1]), who found that phonotactics are used by listeners as a cue to the location of word boundaries. The present investigation addresses the question of whether people can also use less extreme sequential probabilities as a segmentation cue. Hearing a combination of sounds that often occurs at the end of a word or syllable may facilitate recognition of a following word; hearing a combination of sounds that occurs often at the beginning of a word or syllable may facilitate recognition of a preceding word. In a word-spotting task some indications were found that people are sensitive to sequential probabilities. However, no effects were found that strongly support the hypothesis that people do indeed use these distributional properties of the lexicon in the segmentation of spoken language.
ABSTRACT
In this experiment, the acoustic correlates of perceived emotions in singing were investigated. Singers were instructed to sing one phrase in a neutral way and in the emotions anger, joy, fear, and sadness. Listeners rated the strength of the perceived emotions for each fragment. Principal component analyses were performed on the listeners' ratings. The derived factors were interpreted as listening strategies; and a listener's factor loading as an indicator of the extent to which that listener used that strategy. Using the original ratings and the factor loadings, the phrases were assigned composite ratings for each emotion. Acoustic measures of spectral balance, vibrato, duration and intensity were related to the composite ratings using multiple regression analyses. It was found that anger was associated with the presence of vibrato; joyous phrases had vibrato, a short final duration, and a shallow spectral slope; sadness was associated with absence of vibrato, long duration, and a low intensity, whereas fear was related to a steep spectral slope.
ABSTRACT
A study by Pitt and Samuel (1990) found that English speakers could narrowly focus attention onto a precise phonemic position inside spoken words [1]. This led the authors to argue that the phoneme, rather than the syllable, is the primary unit of speech perception. Other evidence, obtained with a syllable detection paradigm, has been put forward to propose that the syllable is the unit of perception; yet, these experiments were ran with French speakers [2]. In the present study, we adapted Pitt & Samuel's phoneme detection experiment to French and found that French subjects behave exactly like English subjects: they too can focus attention on a precise phoneme. To explain both this result and the established sensitivity to the syllabic structure, we propose that the perceptual system automatically parses the speech signal into a syllabically-structured phonological representation.
ABSTRACT
In this study vowels in /CVC/ environments are compared with steady state vowels to investigate the perceived vowel quality change caused by undershoot. This study uses a perceptual task, whereby listeners match constant /CVC/ stimuli of /bVb/ or /dVd/ to variable /#V#/ stimuli, using a schematic grid on a PC screen. The grid represents an acoustic vowel diagram, and the subjects change the F1/F2 frequencies of /#V#/ by moving a mouse. The main results of the study show that while subjects referred to the trajectory peak of the /CVC/ stimuli in vowel quality perception, their performance was also affected by the formant trajectory range of the stimuli. When the formant trajectory range was small, they selected a value between the edge and peak frequencies, while they selected a value outside the trajectory range when it was large.
ABSTRACT
Words can be distinguished by segmental differences or by suprasegmental differences or both. Studies from English suggest that suprasegmentals play little role in human spoken-word recognition; English stress, however, is nearly always unambiguously coded in segmental structure (vowel quality); this relationship is less close in Dutch. The present study directly compared the effects of segmental and suprasegmental mispronunciation on word recognition in Dutch. There was a strong effect of suprasegmental mispronunciation, suggesting that Dutch listeners do exploit suprasegmental information in word recognition. Previous findings indicating the effects of mis-stressing for Dutch differ with stress position were replicated only when segmental change was involved, suggesting that this is an effect of segmental rather than suprasegmental processing.
ABSTRACT
This project re-examines the perceptual weight of vowel duration and the first two vowel formant frequencies as determinants of phonologically short and long vowels in Swedish. Based on listeners' responses to synthesized sets of materials for [I]-[i:] []-[o:] and [a]-[a:], results indicate that vowel duration is of primary importance for distinguishing [I) from [i:] and [] from [o:], whereas both formant frequencies and vowel duration were found to influence the perception of [a] from [a:].
ABSTRACT
A set of three perceptual experiments is described. These experiments were designed to provide identification scores on CV sequences for French. Original stimuli were augmented with acoustic "monsters" where burst were excised or replaced. The first identification task shows that information carried by vocalic transitions can be overwritten by burst information. The importance of this phenomenon is inversely proportional to vowel aperture. The second experiment shows that these results are almost insensitive to relative amplitudes between the burst and the vowel. In the third experiment we manipulated the voice onset time (VOT) of the monsters using high quality analysis-resynthesis. Stimuli with a very short VOT were perceived as bilabials but VOT manipulation did not affect the /t/-/k/ confusions. These experiments claim for a dynamic model of stop identification where burst and vocalic transitions both contribute and compete to the phonetic decision.
ABSTRACT
In order to investigate the relationship between human perception in speaker identification and acoustic features (fundamental frequency (f0), spectrum, and duration) under various communication conditions, this paper describes several perception experiments and an approach to predict the perceptual contribution rate of each feature. Factors taken into account in this paper are: (1) speaker familiarity and (2) background noise. As a result, it is shown that: (1) the perceptual contribution rate increases as the distance of anacoustic feature increases, (2) the spectral contribution rates for familiar speakers are larger than those for unfamiliar speakers, (3) the contribution of f0 tends to increase as the noise increases, and (4) in case of the same S/N ratio, the contribution of f0 in the computer room noise environment is larger than in the car noise environment.
ABSTRACT
This paper presents a perceptual experiment on stimuli synthesized by means of a vocal tract area function model. The purpose was to compare the contribution of dynamic against static information to the identity of a coarticulated vowel. Three sources of information were perceptually analyzed: (i) the vowel nucleus; (ii) the acoustical contrast between the vowel nucleus and the stationary parts of its immediate context; (iii) and the transitions linking the stable parts of the speech signal. The results show that the vocoïds were better identied by dynamic information. This backs up the perceptual overshoot model proposed in Lindblom and Studdert-Kennedy (1967). However, this conclusion must be confirmed by further experiments.
ABSTRACT
Both historical sound change and laboratory confusion studies show strong asymmetries of consonant confusions. In particular, /ki/ commonly changes to /ti/, and /pi/ to /ti/, but not the reverse. It is hypothesized that such asymmetries arise when two sounds are acoustically similar except for one or more differentiating cues, which are subject to a highly directional perceptual error. This perceptual entropy can be explained as follows: if sound x possesses a cue that y lacks, listeners are more likely to miss this "all-or-none" cue than to introduce it spuriously. /k/ and /t/ before /i/ have similar formant transitions but differ in their burst spectra. /p/ and /t/ before /i/ also have similar formant transitions but differ in the intensity of their bursts. The importance of these differentiating features for listeners' perception were verified in a confusion study. The implications of the inversely related effects of perceptual and physical entropy for phonetic theory and speech technology is discussed.
ABSTRACT
Phonological priming between spoken words was examined using CVCVC bisyllabic pseudoword primes and word or pseudoword targets. The influence of different types of overlap was compared, prime and target sharing the coda, the rime or the final syllable. The task was target shadowing. Two priming conditions were used, the auditory targets being preceded by auditory primes in unimodal and by visual primes in crossmodal situation. Priming effects were obtained under unimodal stimulation only. A strong facilitation occurred with syllable overlap while a smaller facilitation was found with rime overlap. Coda overlap produced no effect. The absence of effect in crossmodal stimulation argues that the final overlap effects occur before the semantic system. Concerning the underlying units, a comparison of our results with those obtained from CCVC monosyllables with overlaps in phonemic length similar to those we used, suggests that both rime and syllabic units per se are involved in the effects of final similarity between spoken words.
ABSTRACT
Experiments were performed to investigate perceptual contributions of static and dynamic features of vocal tract characteristics to talker individuality. An ARX (Auto-regressive with exogenous input) speech production model was used to extract separately voice source and vocal tract parameters from a Japanese sentence, /aoiueoie/ ("Say blue top" in English). The Discrete Cosine Transform (DCT) was applied to resolve formant trajectories of the speech signal into static and dynamic components. The perceptual contributions were quantitatively studied by systematically replacing the corresponding formant components extracted from Japanese sentences uttered by three males. Results of the experiments show that the static (average) characteristic of the vocal tract is a primary cue to talker individuality.