Authors:
Masato Akagi, Japan Advanced Institute of Science and Technology (Japan)
Mamoru Iwaki, Japan Advanced Institute of Science and Technology (Japan)
Tomoya Minakawa, Japan Advanced Institute of Science and Technology (Japan)
Page (NA) Paper number 27
Abstract:
This paper reports how rapid fluctuations of fundamental frequencies
in continuously uttered vowels influence vowel quality and shows that
vowel qualities with various fundamental frequency fluctuations can
be discriminated perceptually. For this purpose, electroglottographs
(EGGs) of vowels uttered by nine males were obtained using Laryngograph,
and fundamental frequencies with rapid fluctuations were estimated
from them. Analyzing forty-five estimated fundamental frequencies,
they can be classified into four groups. Moreover, psychoacoustic experiments,
with five subjects, evaluating voice quality by multidimensional scaling
(MDS) showed that voice quality of the synthesized speech using the
fundamental frequencies of the groups was completely discriminable
and there was a distinctive frequency band of fundamental frequency
fluctuation for specifying each group perceptually.
Authors:
Shigeaki Amano, NTT Basic Research Labs. (Japan)
Tadahisa Kondo, NTT Basic Research Labs. (Japan)
Page (NA) Paper number 15
Abstract:
A familiarity database was developed for about 80,000 Japanese words
of which familiarity scores were rated by 32 Japanese adults using
a 7-point scale in auditory, visual, and audio-visual modalities. Auditory,
visual, and audio-visual stimulus words were selected from the database
according to their word familiarity for size estimation of the mental
lexicon. Sixty Japanese adults participated in a two-alternative forced-choice
task (Know-Don't know) for the stimulus words. The size of the mental
lexicon was estimated as the number of words of which familiarity is
above a particular word corresponding to 50% point on the fitted logistic
curve to "know"-response probability of the stimulus words. The estimated
size was about 68,000 for auditory words, and about 66,000 both for
visual and audio-visual words when homophones and homographs were included.
The results suggest that very small difference in the mental lexicon
size among modalities.
Authors:
Matthew Aylett, Human Communication Research Centre, University of Edinburgh (U.K.)
Alice Turk, Department of Linguistics, University of Edinburgh (U.K.)
Page (NA) Paper number 824
Abstract:
Clear speech is characterised by longer segmental durations and less
target undershoot which results in more extreme spectral features.
This paper deals with the clarity of vowels produced in spontaneous
speech in a large corpus of task-oriented dialogues. We present an
automatic technique for measuring vowel clarity on the basis of a vowel's
spectral characteristics. This technique was evaluated using a perceptual
test. Subjects rated the 'goodness' of vowels with different spectral
characteristics with controlled duration and amplitude and these results
were compared with an automatic rating. Results indicated that although
agreement between subjects and the automatic measurement was poor it
was as poor as the agreement between subjects. On the basis of these
results we address the following questions: 1. Can subjects reliably
judge the clarity of vowels excerpted from spontaneous speech without
duration cues? 2. Can a statistical model reliably predict the subjects'
response to such vowels?
Authors:
Adrian Neagu, Institut de la Communication Parlee, Grenoble (France)
Gérard Bailly, Institut de la Communication Parlee, Grenoble (France)
Page (NA) Paper number 1009
Abstract:
In this paper, we study the influence of the vocalic context on the
perception and automatic recognition of stops. In a previous perception
experiment [1] using conflicting cues stimuli, we have shown that place
of articulation cued by formant transitions may be overwritten by the
place cued by the burst. This effect is inversely proportional to
the vowel aperture. Here we give special attention to /i/ context where
nor burst, nor formant transitions seem to carry rich information on
place of articulation. We present here automatic recognition experiments
that confirm perception results. Taking into account both segments
increase identification rates, early fusion of segmental cues performs
best and most errors come from the front unrounded vocalic context.
We introduce the "burst characteristic frequency" (BF) that palliates
for the poor discriminative power of the traditional cues in the front
context. Moreover we present perception results showing the perceptual
relevance of BF.
Authors:
Anne Bonneau, LORIA-CNRS (France)
Yves Laprie, LORIA-CNRS (France)
Page (NA) Paper number 260
Abstract:
With synthetic stimuli copied from natural vowels and including up
to five formants, we investigated some transformations of the perceived
identity of vowels by means of modifications of formant amplitude levels.
Synthetic stimuli were generated by means of a new method of copy synthesis.
With stimuli copied from /u/ we found that, despite the presence of
F2, it is possible to transform the timbre of /u/ into that of a front
vowel only by raising the amplitude level of F3 and higher formants.
We also analysed how the timbre of the vowels /i/ and /y/ changed
as a function of the level of F2, F3 and F4, and showed how sensitive
was the timbre of the vowel /i/ to the decrease of the level of F3
and F4. Such transformations, realized with stimuli very close to
natural vowels, reinforce the importance of formant amplitude levels
in some vocalic distinctions.
Authors:
Hsuan-Chih Chen, Department of Psychology, The Chinese University of Hong Kong (China)
Michael C.W. Yip, Department of Psychology, The Chinese University of Hong Kong (China)
Sum-Yin Wong, Department of Psychology, The Chinese University of Hong Kong (China)
Page (NA) Paper number 660
Abstract:
In a tone language, such as Cantonese, both segmental and tonal distinctions
between words are pervasive. However, previous work in Cantonese has
demonstrated that in speeded-response tasks, tone is more likely to
be misprocessed than is segmental structure. The present study examined
whether this tone disadvantage would also hold after the initial auditory
processing of a syllable had been done. Cantonese listeners were asked
to perform same-different judgments on two sequentially presented open
syllables along a specific dimension (i.e., onset, rime, tone, or the
whole syllable) according to an instruction which was visually presented
at the acoustic offset of the second syllable. Manipulating whether
the difference between two syllables was in onset, rime, or tone resulted
in equally robust effects across the various decision tasks on performance,
indicating that tone functions as effectively as segmental structure
in spoken-word processing once the related information of a syllable
is encoded.
Authors:
Michael C.W. Yip, Department of Psychology, The Chinese University of Hong Kong (China)
Po-Yee Leung, Department of Psychology, The Chinese University of Hong Kong (China)
Hsuan-Chih Chen, Department of Psychology, The Chinese University of Hong Kong (China)
Page (NA) Paper number 661
Abstract:
A Cantonese experiment is described in which the shadowing of spoken
targets as a function of phonological similarity to either a succeeding
prime (backward priming) or a preceding prime (forward priming) is
investigated. In the backward priming conditions, alternations of onset,
rime, or tone between prime and target produced inhibition, whereas
in the forward priming conditions, alternations of tone led to facilitation.
The results are discussed in terms of the processing and memory of
Cantonese syllables.
Authors:
Bob I. Damper, University of Southampton (U.K.)
Steve R. Gunn, University of Southampton (U.K.)
Page (NA) Paper number 843
Abstract:
The categorical perception (CP) of syllable-initial stop consonants
has been intensively studied using psychophysical procedures over many
decades. However, computational models consisting of an auditory `front
end' and a learning system as a `back end' convincingly mimic the essentials
of CP. Unlike real listeners, such models can be systematically manipulated
to uncover the basis of their categorisations. In this paper, we explore
the use of modern inductive learning techniques in simulating CP.
Authors:
Loredana Cerrato, Fondazione Ugo Bordoni (Italy)
Mauro Falcone, Fondazione Ugo Bordoni (Italy)
Page (NA) Paper number 463
Abstract:
We report the results of a study carried out to analyse the acoustic
and perceptual characteristics of Italian stop consonants. The aim
of this study is twofold: give an acoustical description of Italian
stops and investigate which are the perceptual cues relative to their
place of articulation. From the acoustic point of view we report:
the measurements relative to the length of the whole consonant and
of its release burst; the F1 and F2 of the following vowel measured
at the beginning of it. Moreover we counted the presence of the release
burst and we tried to describe its acoustical characteristics in terms
of the spectral structure. From the perceptual point of view we report
the results of three perceptual tests that we run with the aim of evaluating
whether the release burst or the formant transitions are more relevant
for the perception of Italian stop consonants' place of articulation.
Authors:
Santiago Fernández, Departamento de Física Aplicada. Universidad de Santiago de Compostela (Spain)
Sergio Feijóo, Departamento de Física Aplicada. Universidad de Santiago de Compostela (Spain)
Ramon Balsa, Departamento de Física Aplicada. Universidad de Santiago de Compostela (Spain)
Nieves Barros, Departamento de Física Aplicada. Universidad de Santiago de Compostela (Spain)
Page (NA) Paper number 451
Abstract:
This study deals with the distinction of the fricative noises of the
spanish fricatives /th/ and /f/. Previous studies revealed that fricative
noises of both phonemes are perceptually similar, auditory identification
being significantly dependent on contextual effects: /f/ in the /u/
context is well identified (about 85% correct identification rate),
while in the /e/ context identification is much lower (about 60%).
Identification of /th/ is low for every vocalic context (about 60%).
These effects were identical for both Hypo and Hyper forms of speech.
The objective of this paper is to determine which acoustic properties
of /f/ in the /u/ context make it a well defined phoneme for the two
different forms of speech. We conclude that the cues for the identification
of the isolated fricative noise of /f/ seem to be in the low frequency
region of the spectrum.
Authors:
Santiago Fernández, Departamento de Física Aplicada. Universidad de Santiago de Compostela. (Spain)
Sergio Feijóo, Departamento de Física Aplicada. Universidad de Santiago de Compostela. (Spain)
Ramon Balsa, Departamento de Física Aplicada. Universidad de Santiago de Compostela. (Spain)
Nieves Barros, Departamento de Física Aplicada. Universidad de Santiago de Compostela. (Spain)
Page (NA) Paper number 452
Abstract:
The role of fricative context on vowel recognition in a series of FV
syllables being part of natural Spanish words is investigated. Perceptual
tests were carried out to assess the recognition of vowels in fricative
context, in two conditions: 1) Isolated vowel; 2) Fricative noise +
vowel. Analysis of results show that adding the fricative noise improves
the recognition of the vowel, while the acoustic analysis reveal that
the distribution of the vowels is affected by fricative context. A
possible explanation for this improvement, i.e. the coarticulatory
influence of the vowel on the fricative, was investigated. The results
indicate that coarticulation cannot explain that improvement, since
only 7.7% of the cases which improve when the fricative is added, show
a clear influence of the vowel on the fricative.
Authors:
Santiago Fernández, Departamento de Física Aplicada. Universidad de Santiago de Compostela. (Spain)
Sergio Feijóo, Departamento de Física Aplicada. Universidad de Santiago de Compostela. (Spain)
Plinio Almeida, Instituto de Estudos Linguisticos. Universidade de Campinas. (Brazil)
Page (NA) Paper number 453
Abstract:
The perception of voiced fricatives by native speakers of a language
which lacks those phonemes is studied in this paper. Brasilian portuguese
and Galician languages were chosen because they are historically related.
A forced choice test reveals that listeners correctly perceive the
place of articulation of the voiced fricatives. In order to examine
whether the perception of fricative manner can be overridden by the
voicing characteristics an open test was carried out. Listeners perceive
voiced fricatives as a voiced phoneme with different manner of articulation
and similar place of articulation or as its voiceless counterpart,
depending on whether vocal-fold vibration extends over the whole obstruent
interval or not. Results are discussed in terms of both historical
phonetic changes and second language acquisition.
Authors:
Valerie Hazan, University College London (U.K.)
Andrew Simpson, University College London (U.K.)
Mark Huckvale, University College London (U.K.)
Page (NA) Paper number 487
Abstract:
The aim of our work is to increase the intelligibility of speech in
noise by modifying regions of the signal that contain acoustic cues
to consonant identity in order to make it more resistant to subsequent
degradation. 36 vowel-consonant-vowel stimuli were recorded by four
untrained speakers. The vowel onset/offset and consonant constriction/occlusion
regions were selectively amplified and stimuli were presented to listeners
in a background of noise (0 dB SNR). Enhanced tokens from all speakers
were significantly more intelligible than natural tokens and the improvement
was greater for the initially least intelligible speakers. Speech material
for two speakers was then presented to Japanese and Spanish learners
of English and controls. For all groups, the enhanced consonants were
more intelligible. Error patterns were related to the 'distance' between
the consonantal systems of the listeners' L1 and L2. These results
demonstrate the robustness of our enhancement techniques across speaker
and listener types.
Authors:
Fran H.L. Jian, Dept. Linguistics,University of Reading (U.K.)
Page (NA) Paper number 145
Abstract:
In this work we set out to investigate the fundamental frequency boundaries
of perception of the Taiwanese long tones. We are interested in how
the variations in fundamental frequency affect the perception of linguistic
tones in Taiwanese speech. Our investigation is adopted from similar
studies of tones in Mandarin speech. As opposed to Mandarin tones
that can be perceived with little difficulty the seven Taiwanese tones
have a more subtle structure and are consequently harder to perceive
successfully. The experimental results in this paper allow us to quantify
these perceptual boundaries. The experiments consisted of a perception
test involving over 150 Taiwanese subjects where the task involved
identifying the tone of the words played back in a random sequence.
The stimuli consisted of a set of tone pairs and a selection of intermediate
tone words obtained by linearly interpolating between the words of
the tone pairs.
Authors:
Hiroaki Kato, ATR HIP (Japan)
Minoru Tsuzaki, ATR HIP (Japan)
Yoshinori Sagisaka, ATR ITL (Japan)
Page (NA) Paper number 411
Abstract:
To establish a perceptually valid rule for the durational control of
synthetic speech, it is necessary to know the degree to which a given
temporal error or distortion is acceptable to human listeners. Two
perceptual experiments were conducted to estimate the acceptability
of modifications in either vocalic or consonantal durations as a function
of two attributes of the modified portions, i.e., the phonetic quality
and the original (unmodified) duration. The results showed that the
listeners' acceptable modification ranges were narrowest for vowels,
and widest for voiceless fricatives and silent closures, with nasals
in between. They were also narrower for those portions with shorter
base durations. The effect of the original duration was larger for
the vowel stimuli than for the voiceless fricative stimuli. The perceptual
mechanism mediating these results is discussed with regard to the dependency
of the listeners' temporal sensitivity on the stimulus loudness and
base duration. [Re: http://www.hip.atr.co.jp/~kato/single_duration/]
Authors:
Michael Kiefte, University of Alberta (Canada)
Terrance M. Nearey, University of Alberta (Canada)
Page (NA) Paper number 898
Abstract:
In order to assess the importance of dynamic spectral information within
the first few milliseconds following oral release for the identification
of prevocalic stop consonants, 23.75 ms gated CV syllables were presented
to listeners for identification. In addition to these, subjects were
presented with the same tokens reconstructed from their minimum phase
decomposition such that they have the same long-term power spectrum
as their original counterparts, but with differing internal dynamic
spectral detail. Subjects' results from this experiment were then
modelled with logistic regression analysis using mel cepstral coefficients
with and without dynamic spectral information encoded in order to demonstrate
the effect that reduced temporal information has in the context of
automatic classification. Preliminary results from this experiment
show that some dynamic spectral detail is used by listeners even for
very short stimuli. We conclude that models of speech perception must
take spectral variation over very short time frames into account.
0898_01.WAV
(was: 0898_01.wav)
| The original token can be heard here.
File type: Sound File
Format: Sound File: WAV
Tech. description: Unknown
Creating Application:: Unknown
Creating OS: Unknown
|
0898_02.WAV
(was: 0898_02.wav)
| The minimum phase reconstruction can be heard here.
File type: Sound File
Format: Sound File: WAV
Tech. description: Unknown
Creating Application:: Unknown
Creating OS: Unknown
|
Authors:
Takashi Otake, Dokkyo University (Japan)
Kiyoko Yoneyama, Ohio State University (USA)
Page (NA) Paper number 35
Abstract:
This paper explores the relationship between phonological units in
speech segmentation and phonological awareness by investigating Japanese
Brazilians living in Japan. The first experiment investigated the
size of the phonological unit in speech segmentation using the Japanese
materials and methodology in Otake et al. (1993). As for French subjects
in the earlier study, the miss rates showed an effect of syllabic segmentation,
suggesting that the Japanese Brazilians segmented Japanese into syllables.
The second experiment investigated phonological units in phonological
awareness, using a mid-chunk-unit search task in which subjects were
asked to identify the middle unit within a word. 96% of the mid-chunk
unit choices were syllable-based. The results of the two experiments
suggest that Japanese Brazilians exploit syllables both as a speech
segmentation unit and as a unit to represent within-word structure.
Authors:
Elizabeth Shriberg, SRI International (USA)
Andreas Stolcke, SRI International (USA)
Page (NA) Paper number 58
Abstract:
Speakers frequently retrace one or more words when continuing after
a break in fluency. Syntactic principles constrain the points from
which speakers retrace; however syntactic principles do not provide
predictions about the relative usage of different allowable retrace
points. Such predictions are useful for automatic processing of repairs
in speech technology, particularly if they use information readily
available to a speech recognizer. We propose a quantitative model
that predicts the overall distribution of retrace lengths in a large
corpus of spontaneous speech, based only on word position. The model
has two components: (1) a constant, position-independent probability
for extending a retrace by one more word; and (2) a position-dependent
probability to "skip" to the beginning of the sentence. Results have
implications for modeling repairs in speech applications and constrain
explanatory models in psycholinguistics.
Authors:
Karsten Steinhauer, Max-Planck-Institute of Cognitive Neuroscience, Inselstr. 22, D-04103, Leipzig (Germany)
Kai Alter, Max-Planck-Institute of Cognitive Neuroscience, Inselstr. 22, D-04103, Leipzig (Germany)
Angela D. Friederici, Max-Planck-Institute of Cognitive Neuroscience, Inselstr. 22, D-04103, Leipzig, (Germany)
Page (NA) Paper number 147
Abstract:
This paper investigates the prosodic relevance of a pause which, along
with other prosodic parameters, served to indicate an Intonational
Phrase (IPh) boundary. Event-related brain potentials (ERPs) were recorded
while subjects listened to both intact and altered German Early and
Late Closure (EC/LC) sentences. The EC sentences were prosodically
highly accepted and well comprehended even when the original pause
at the boundary position was removed. Furthermore, a reversed garden-path
(initial EC preference in LC sentences) was successfully induced by
a false IPh boundary irrespective of whether the pause was present
or not. The ERP patterns disclosed the on-line processing of simple
and garden-path sentences in more detail. The data clearly demonstrate
that in the presence of other prosodic parameters pause insertion is
a completely dispensable cue for boundary marking. The ERP technique
proved to be superior to behavioral on-line measures as data collection
does not interrupt speech presentation.
Authors:
Jean Vroomen, University of Tilburg (The Netherlands)
Beatrice de Gelder, University of Tilburg (The Netherlands)
Page (NA) Paper number 348
Abstract:
In the present study, we examined whether stress constrains the number
of activated lexical candidates. In a phoneme monitoring task, we used
Dutch carrier words that start in their citation form with a reduced
vowel (denoted as @), but which can also be produced with an unreduced
vowel. For example, a word such as frequent (meaning frequent) can
be pronounced as fr@QUENT or freQUENT. We examined whether mis-stressing
these words had an effect on the activation of their lexical representation.
Twenty subjects detected a target phoneme (e.g., the 't') in fr@QUENT,
freQUENT, FR@quent, or FREquent; stress denoted in capitals. Results
showed that target phonemes in words were reacted faster than in pseudowords,
but neither stress, nor the nature of the vowel had an effect on the
size of lexical effect. This confirms that stress is not part of the
lexical input representation.
Authors:
Jyrki Tuomainen, Tilburg University (The Netherlands)
Jean Vroomen, Tilburg University (The Netherlands)
Beatrice de Gelder, Tilburg University (The Netherlands)
Page (NA) Paper number 760
Abstract:
The effect of word level prominence on detection speed of word boundaries
in Finnish was investigated in two word spotting experiments. The results
showed that the perceived stress was not a function of the fundamental
frequency (F0) difference between the preceding syllable and the first
syllable of the target word. Given the fast response times, the results
suggest that subjects perceived in both experiments the first syllable
of the target as stressed. This seems to indicate that when words are
recognized in continuous speech the acoustic cues in the F0 contour
signaling prominence may not be computed relative to the prominence
of neighboring syllables. Instead, we hypothesize that subjects may
be sensitive to a local pitch movement indicating change in the F0
slope.
Authors:
Kimiko Yamakawa, Prefectural University of Kumamoto (Japan)
Ryoji Baba, Prefectural University of Kumamoto (Japan)
Page (NA) Paper number 312
Abstract:
Usually the first vowel of the Japanese word 'susugi' (rinse) disappears,
and so the pronunciation of 'susugi' is not [susugi] but [ssugi]. We
made two psychophysical experiments. In the first one we shortened
the part of [ss] of [ssugides] and [korewassugides] (This is rinse.)
in 6 stages. [korewassugides] with shortened [ss] is easy to perceive
as [korewasugides] (This is cedar) , but [ssugides] with shortened
[ss] is not easy to perceive of [sugides] (Japanese cedar). In the
second experiment we changed the pitches of [sugides] and researched
how Japanese perceive these sounds.
|