Authors:
Peter Blamey, The University of Melbourne (Australia)
Julia Sarant, Bionic Ear Institute (Australia)
Tanya Serry, Bionic Ear Institute (Australia)
Roger Wales, The University of Melbourne (Australia)
Christopher James, The University of Melbourne (Australia)
Johanna Barry, The University of Melbourne (Australia)
Graeme M. Clark, The University of Melbourne (Australia)
M. Wright, Children's Cochlear Implant Centre (Australia)
R. Tooher, Children's Cochlear Implant Centre (Australia)
C. Psarros, Children's Cochlear Implant Centre (Australia)
G. Godwin, Children's Cochlear Implant Centre (Australia)
M. Rennie, Children's Cochlear Implant Centre (Australia)
T. Meskin, Children's Cochlear Implant Centre (Australia)
Page (NA) Paper number 248
Abstract:
Fifty seven children with impaired hearing aged 4-12 years were evaluated
with speech perception and language measures as the first stage of
a longitudinal study. The Clinical Evaluation of Language Fundamentals
(CELF) and Peabody Picture Vocabulary Test (PPVT) were used to evaluate
the children's spoken language. Regression analyses indicated that
scores on both tests were significantly correlated with chronological
age, but delayed relative to children with normal hearing. Performance
increased at 45% of the rate expected for children with normal hearing
for the CELF, and 62% for the PPVT. Perception scores were not significantly
correlated with chronological age, but were highly correlated with
results on the PPVT and CELF. The data suggest a complex relationship
whereby hearing impairment reduces speech perception, which slows language
development, which has a further adverse effect on speech perception.
Authors:
Catia Cucchiarini, A2RT, University of Nijmegen (The Netherlands)
Helmer Strik, A2RT, University of Nijmegen (The Netherlands)
Louis Boves, A2RT, University of Nijmegen (The Netherlands)
Page (NA) Paper number 752
Abstract:
This paper describes an experiment aimed at determining whether native
and non-native speakers of Dutch significantly differ on a number of
quantitative measures related to fluency and whether these measures
can be successfully employed to predict fluency scores. Read speech
of 20 native and 60 non-native speakers of Dutch was scored for fluency
by nine experts and was then analyzed by means of an automatic speech
recognizer in order to calculate nine quantitative measures of speech
quality that are known to be related to perceived fluency. The results
show that the natives' scores on the fluency ratings and on the quantitative
measures significantly differ from those of the non-natives, with the
native speakers being considered more fluent. Furthermore, it appears
that quantitative variables such as rate of speech, phonation-time
ratio, number of pauses, and mean length of runs are able to predict
fluency scores with a high degree of accuracy.
Authors:
Paul Dalsgaard, Center for PersonKommunikation (Denmark)
Ove Andersen, Center for PersonKommunikation (Denmark)
William J. Barry, Universität des Saarlandes (Germany)
Page (NA) Paper number 482
Abstract:
The focus of this paper is to formulate an approach to merging phonemes
across languages and to evaluate the resulting cross-language merged
speech units on the basis of the traditional acoustic-phonetic descriptions
of the phonemes. The methodology is based on the belief that some phonemes
across a set of languages may be similar enough to be equated, contrasting
traditional phonology which treats phonemes from one language independent
from phonemes from another language. The identification of cross-language
speech units is performed by an iterative data-driven procedure, which
merges acoustically similar phonemes from within one language as well
as across languages. The paper interprets a number of merged speech
units on the basis of articulatory descriptions.
Authors:
Robert Eklund, Telia Research AB (Sweden)
Elizabeth Shriberg, SRI International (USA)
Page (NA) Paper number 805
Abstract:
We report results from a cross-language study of disfluencies (DFs)
in Swedish and American English human-machine and human-human dialogs.
The focus is on comparisons not directly affected by differences in
overall rates since these could be associated with task details. Rather,
we focus on differences suggestive of how speakers utilize DFs in the
different languages, including: relative rates of the use of hesitation
forms, the location of hesitations, and surface characteristics of
DFs. Results suggest that although the languages differ in some respects
(such as the ability to insert filled pauses within `words'), in many
analyses the languages show similar behavior. Such results provide
suggestions for cross-linguistic DF modeling in both theoretical and
applied fields.
Authors:
Horacio Franco, SRI International (USA)
Leonardo Neumeyer, SRI International (USA)
Page (NA) Paper number 764
Abstract:
Our proposed paradigm for automatic assessment of pronunciation quality
uses hidden Markov models (HMMs) to generate phonetic segmentations
of the student's speech. From these segmentations, we use the HMMs
to obtain spectral match and duration scores. In this work we focus
on the problem of mapping different machine scores to obtain an accurate
prediction of the grades that a human expert would assign to the pronunciation.
We discuss the application of different approaches based on minimum
mean square error (MMSE) estimation and Bayesian classification. We
investigate the characteristics of the different mappings as well as
the effects of the prior distribution of grades in the calibration
database. We finally suggest a simple method to extrapolate mappings
from one language to another.
Authors:
Petra Geutner, Universitaet Karlsruhe (Germany)
Michael Finke, Carnegie Mellon University (USA)
Alex Waibel, Carnegie Mellon University (USA)
Page (NA) Paper number 771
Abstract:
High OOV-rates are one of the most prevailing problems for languages
with a rapid vocabulary growth, e.g. when transcribing Serbo-Croatian
and German broadcast news. Hypothesis-Driven-Lexical-Adaptation (HDLA)
has been shown to decrease high OOV-rates significantly by using morphology-based
linguistic knowledge. This paper introduces another approach to dynamically
adapt a recognition lexicon to the utterance to be recognized. Instead
of morphological knowledge about word stems and inflection endings,
distance measures based on Levenstein distance are used. Results based
on phoneme and grapheme distances will be presented. Our distance-based
approach requires no expert knowledge about a specific language and
no definition of complex grammar rules. Instead, grapheme sequences
or the phoneme representation of words are sufficient to apply our
HDLA-algorithm easily to any new language. With our proposed technique
OOV-rates were decreased by more than half from 8.7% to 4%, thereby
also improving recognition performance by an absolute 4.1% from 29.5%
to 25.4% word error rate.
Authors:
Chul-Ho Jo, Kyoto University (Japan)
Tatsuya Kawahara, Kyoto University (Japan)
Shuji Doshita, Kyoto University (Japan)
Masatake Dantsuji, Kyoto University (Japan)
Page (NA) Paper number 741
Abstract:
We propose an effective application of speech recognition to foreign
language pronunciation learning. The objective of our system is to
detect pronunciation errors and provide diagnostic feedback through
speech processing and recognition methods. Automatic pronunciation
error detection is used for two kinds of mispronunciation, that is
mistake and linguistical inheritance. The correlation between automatic
detection and human judgement shows its reliability. For the feedback
guidance to an erroneous phone, we set up classifiers for the well-recognized
articulatory features, the place of articulation and the manner of
articulation, in order to identify the cause of incorrect articulation.
It provides feedback guidance on how to correct mispronunciation.
Authors:
Roger Ho-Yin Leung, Chinese University of Hong Kong (Hong Kong)
Hong C. Leung, Chinese University of Hong Kong (Hong Kong)
Page (NA) Paper number 229
Abstract:
In this paper, the lexical characteristics of two Chinese dialects
and American English are explored. Different lexical representations
are investigated, including the tonal syllables, base syllables, phonemes,
and the broad phonetic classes. Multiple measurements are made, such
as coverage, uniqueness, and cohort sizes. Our results are based on
lexicons of 44K and 52K words in Chinese and English obtained from
the CallHome Corpus and the COMLEX Corpus, respectively. We have found
that the set of the most frequent 4,000 words has coverage of 92% and
77% for Chinese and English, respectively. The phonetic representation
unique specifies 85%, 87% and 93% of the lexicon for Mandarin, Cantonese,
and English, respectively. While the three languages appear quite different
when they are described by their full phoneme sets, their characteristics
are more similar when they are represented in terms of broad phonetic
classes.
Authors:
Sharlene Liu, Nuance (USA)
Sean Doyle, General Magic (USA)
Allen Morris, Soft Gam (USA)
Farzad Ehsani, Sehda (USA)
Page (NA) Paper number 847
Abstract:
ABSTRACT We study the effects of modeling tone in Mandarin speech recognition.
Including the neutral tone, there are 5 tones in Mandarin and these
tones are syllable-level phenomena. A direct acoustic manifestation
of tone is the fundamental frequency (f0). We will report on the effect
of f0 on the acoustic recognition accuracy of a Mandarin recognizer.
In particular, we put f0, its first derivative (f0'), and its second
derivative (f0'') in separate streams of the feature vector. Stream
weights are adjusted to investigate the individual effects of f0, f0',
and f0'' to recognition accuracy. Our results show that incorporating
the f0 feature negatively impacted accuracy, whereas f0' increased
accuracy and f0'' seemed to have no effect.
Authors:
Duncan Markham, Deakin University (Australia)
Page (NA) Paper number 424
Abstract:
Tests of foreign accent usually treat native listeners as reliable
providers of accentedness ratings, and pay too little heed to task-specific
effects on non-native speakers' performance. This paper details a number
of factors which in fact influence native listeners' perceptions, and
the native-like behaviour of non-native speakers' productions, based
on the results of a large study of phonetic performance in second language
learners. Listeners were observed to vary, at times considerably, in
their perception of accent, depending on context, and type of stimulus,
and at times showed distinctly idiosyncratic scoring patterns. Listeners'
reactions to speaker voice pathology, mixed dialect pronunciation,
and artefacts of read speech are discussed, and the effects of using
different types of scoring system are examined.
Authors:
Michael F. McTear, University of Ulster (Ireland)
Eamonn A. O'Hare, St Mark's High School (Ireland)
Page (NA) Paper number 546
Abstract:
This paper reports on an exploratory study in which a group of second
year secondary school pupils with reading ages ranging from 8.3 to
12.9 performed a set of tasks using the IBM VoiceType dictation package
in order to determine the benefits of voice dictation for classroom
use. The results showed that pupils with varying reading ages could
dictate at comparable speeds and often with similar degrees of accuracy.
Homophones were almost never a source of error in the texts produced
with voice dictation, as compared with the children's handwritten texts.
The implications of these findings for the use of dictation software
in the classroom and for further studies of the potential of voice
dictation for improving children's spelling and composition skills
are discussed.
Authors:
Kazuo Nakayama, Yamagata University (Japan)
Kaoru Tomita-Nakayama, Yamagata University (Japan)
Page (NA) Paper number 446
Abstract:
We investigated English spoken word recognition of adult Japanese speakers.
We found that the accurate recognition of the first syllable played
an important role in recognizing a word correctly. It was implied
that their recognition performance would be enhanced by time-scale
expansion and/or dynamic range compression. The duration of a beginning
word is so short that the listener can't recognize it correctly. In
the first experiment, we found that they had difficulty in recognizing
both isolated words and the extracted words, especially when the word
did not begin with a strong syllable. In the second experiment, the
extracted words and the corresponding time-scale expanded words were
given. The result indicated that the expanded words were better recognized.
It is found that the time-scale modification of the extracted words
didn't lose intelligibility even around the ratio of 2.00, as was clear
from the fact that the recognition improved.
Authors:
Anne-Marie Öster, Department of Speech, Music and Hearing, KTH (Sweden)
Page (NA) Paper number 256
Abstract:
Teaching strategies and positive results from training of both perception
and production of spoken Swedish with 13 immigrants are reported. The
learners participated in six training sessions lasting thirty minutes,
twice a week. The training had a positive effect on the L2-speakers'
perception and production of individual Swedish sounds, stress, intonation
and rhythm. The positive results were obtained through auditory and
visually contrastive feedback provided through a PC running the IBM
SpeechViewer software. Skill building modules together with the Speech
Pattering Module "Pitch and Loudness" was used, that displays the speech
signal as graphical curves and diagrams. A split screen offered a comparison
of the student's production with a correct model by the teacher. Pitch
and loudness were displayed either separately or combined.
Authors:
Dominiek Sandra, University of Antwerp (Belgium)
Steven Gillis, University of Antwerp (Belgium)
Page (NA) Paper number 1106
Abstract:
Children of three different ages (five, eight, and ten years old) were
asked to syllabify a list of auditorily presented words. The list
composition was such that the effect of different knowledge sources
on the children's intuitive syllabification could be assessed: the
relative importance of language-universal versus language-specific
phonological constraints, the effect of morphological complexity, and
the effect of orthographic knowledge. The results indicate that five-year
old children are already aware of language-specific constraints and
are sensitive to the phonological distinction between continuant and
non-continuant consonants. Literate children (eight and ten years old)
are influenced in their syllabification behavior by their orthographic
knowledge, i.e. once children have reached the literate stage it is
difficult for them to separate phonological and orthographic knowledge
in this phonological task. Finally, children in all three age groups
did not syllabify singulars differently than phonologically closely
matched plurals.
Authors:
Ayako Shirose, Department of Cognitive Sciences, Graduate school of Medicine, University of Tokyo (Japan)
Haruo Kubozono, Kobe University (Japan)
Shigeru Kiritani, University of Tokyo (Japan)
Page (NA) Paper number 1107
Abstract:
This paper reports the results of research on the process of acquisition
of Japanese compound accent rules by children aged 4-5. The results
reveal: 1) Children acquire general rules before they acquire lexical
idiosyncratic rules. 2) Children failed to retain the accent of the
second element, and instead place an incorrect accent on the penultimate
foot. This result suggests that children acquire placing accent on
the penultimate foot prior to retaining the lexical accent of the second
element. We discussed a similarity between above result and a constraint-reranking
phenomenon in adults' phonology. 3) The syllable, which plays a important
role in adults' CA rules, does not contribute to the CA rules in children's
phonology. We assumed that children have not acquired sufficient understanding
of the syllable to contribute to the CA rules.
Authors:
Lydia K.H. So, The University of Hong Kong (Hong Kong)
Zhou Jing, The University of Hong Kong (Hong Kong)
Page (NA) Paper number 956
Abstract:
Ibis paper reports the phoneme repertoires and phonological error patterns
of 600 Chinese-speaking children aged 2;0 to 7;0. The findings support
the hypotheses that phonological acquisition is influenced by the ambient
language and the mother tongue.
Authors:
Kaoru Tomita-Nakayama, Yamagata University (Japan)
Kazuo Nakayama, Yamagata University (Japan)
Masayuki Misaki, Matsushita Electrical Industries Co. Ltd. (Japan)
Page (NA) Paper number 180
Abstract:
This study demonstrated that the time-scale expansion of speech with
the constant pitch (henceforth, expanded speech) enhanced the speech
recognition of Japanese learners of English in contrast to the previous
studies in which the time-scale expanded speeches did not contribute
to the speech recognition chiefly because of the severe distortion
of original speech and pitch change. Experiments were administered
with the stimuli of original normal speech and the corresponding expanded
speech. These results showed that the expanded speech stimuli were
intelligible to many of the subjects. The hypotheses are that the expanded
speech enhances listeners' speech processing and also enables the listeners
to call into play virtual memory capacity for an on-line speech processing,
which are more apparent in a longer stimulus. The expanded speech worked
well with most subjects. Another prescriptions should be prepared for
the rest of the subjects for whom the expanded speech was not solely
much effective.
Authors:
Volker Warnke, University of Erlangen (Germany)
Elmar Nöth, University of Erlangen (Germany)
Jan Buckow, University of Erlangen (Germany)
Stefan Harbeck, University of Erlangen (Germany)
Heinrich Niemann, University of Erlangen (Germany)
Page (NA) Paper number 316
Abstract:
In this paper, we present a bootstrap training approach for language
model (LM) classifiers. Training class dependent LM and running them
in parallel, LM can serve as classifiers with any kind of symbol sequence,
e.g., word or phoneme sequences for tasks like topic spotting or language
identification (LID). Irrespective of the special symbol sequence used
for a LM classifier, the training of a LM is done with a manually labeled
training set for each class obtained from not necessarily cooperative
speakers. Therefore, we have to face some erroneous labels and deviations
from the originally intended class specification. Both facts can worsen
classification. It might therefore be better not to use all utterances
for training but to automatically select those utterances that improve
recognition accuracy; this can be done by a bootstrap procedure. We
present the results achieved with our best approach on the VERBMOBIL-corpus
for the tasks dialog act classification and LID.
Authors:
Sandra P. Whiteside, University of Sheffield (U.K.)
Jeni Marshall, University of Sheffield (U.K.)
Page (NA) Paper number 154
Abstract:
Voice onset time (VOT) is a key temporal feature in spoken language.
There is some evidence to suggest that there are sex differences in
VOT patterns. The cause of these sex differences could be attributed
to sexual dimorphism of the vocal apparatus. There is also some evidence
to suggest that phonetic sex differences could also be attributed to
learned stylistic and linguistic factors. This study reports on an
investigation into the VOT patterns for /p b t d/ in a group of thirty
children aged 7 (n=10), 9 (n=10) and 11 (n=10) years, with equal numbers
of girls (n=5) and boys (n=5) in each age group. Age and sex differences
were examined for in the VOT data. Age, sex and age-by-sex interactions
were found. The results are presented and discussed.
Authors:
Sandra P. Whiteside, University of Sheffield (U.K.)
Carolyn Hodgson, University of Sheffield (U.K.)
Page (NA) Paper number 155
Abstract:
The process of the development of fine motor speech skills co-occurs
with the maturation of the vocal apparatus. This brief study presents
some acoustic phonetic characteristics of the speech of twenty pre-adolescent
(6-, 8- and 10-year-olds) boys and girls. The speech data were elicited
via a picture-naming task. Both age and sex differences in the acoustic
phonetic characteristics of selected vowels and consonants are examined.
The acoustic phonetic characteristics that were investigated included
formant frequency values and coarticulation (or gestural overlap) patterns.
Age, sex and age-by-sex differences for the acoustic phonetic characteristics
are presented and discussed for the data with reference to speech development
and the sexual dimorphism of the vocal apparatus.
Authors:
Lisa-Jane Brown, Department of Human Communication Science, Sheffield University, Claremont Crescent, Sheffield (U.K.)
John Locke, Department of Human Communication Science, Sheffield University, Claremont Crescent, Sheffield (U.K.)
Peter Jones, Sheffield (Hallam) University (U.K.)
Sandra P. Whiteside, Department of Human Communication Science, Claremont Crescent, Sheffield (U.K.)
Page (NA) Paper number 791
Abstract:
The atypical linguistic processing and cognitive development of previously
institutionalised, adopted Romanian children are being researched using
a neurolinguistic theory of development. Of particular concern is
the Critical Period Hypothesis which holds that language capacity can
only develop in response to relevant stimulation during a pre-determined
period in childhood. The research impetus derives from the need to
understand the course of first language acquisition in children who
have suffered extreme deprivation at an early age. The purpose of this
paper is to attempt to analyse what these children can tell us about
the potential for language development in the face of such deprived
circumstances. In order to examine this, a theory of neurolinguistic
development will be applied to the case study of a formerly institutionalised
Romanian child, Maria. A key question will be addressed: Has Maria's
early deprivation set for her an irreversible path in terms of attaining
normal language development?
Authors:
Geoff Williams, SOAS, University of London/RMS Inc (USA)
Mark Terry, RMS Inc (USA)
Jonathan Kaye, SOAS, University of London (U.K.)
Page (NA) Paper number 622
Abstract:
This paper proposes a novel architecture for language-independent ASR
based on government phonology (GP). We use experimental data to show
that phoneme-based recognisers perform poorly on languages other than
the original target, rendering such systems inadequate for multi-lingual
speech recognition, a result we attribute to the inadequacy of the
phoneme as a linguistic unit. In the proposed GP model, recognition
targets are a small set of sub-segmental primes, or "elements", found
in all languages, which have been previously shown to be robustly detected
in a language-independent manner. Well-formedness constraints are captured
by simple parameter settings which can be easily encoded as rules and
applied as top-down constraints in a speech recogniser. Hence, given
a set of trained element detectors, a recogniser for any given language
can in principle be rapidly built by selection of the appropriate lexicon
and constraints. We describe the design of experimental architectures
for our GP-based system.
Authors:
Claudio Zmarich, CNR-Istituto di Fonetica e Dialettologia, Padova (Italy)
Roberta Lanni, CNR-Istituto di Fonetica e Dialettologia, Padova (Italy)
Page (NA) Paper number 1004
Abstract:
This single case study aims to combine the auditory assessment method
with the precision offered by the instrumental measurement of acoustic
characteristics, in order to investigate the phonetic aspect of early
speech development, namely babbling and early words. While generic
progress may be determined in the increasing prevalence of the number
of CV syllables within the global repertory of utterances, the aspects
that better reveal the influence of a target language include the frequency
of occurrence of vowel types, especially if their classification refers
to front-back dimension, in combination with an expansion and refining
of phonotactic possibilities. Further, acoustic and articulatory evidence
reveals an initial tendency for more control of height dimension than
front/back. The patterns of C-V associations suggest that the child
develops from a babbling phase characterized by the overwhelming prevalence
of front articulations, to a phase characterized by the presence of
the first words, where the patterns predicted by the MacNeilage and
Davis theory occur, perhaps owing to the presence of the same patterns
in the target lexicon.
Authors:
Roland Kuhn, Panasonic Technologies Inc., Speech Technology Laboratory (USA)
Jean-Claude Junqua, Panasonic Technologies Inc., Speech Technology Laboratory (USA)
Philip D. Martzen, Panasonic Technologies Inc., Speech Technology Laboratory (USA)
Page (NA) Paper number 304
Abstract:
Building on earlier work, we show how a set of binary decision trees
can be trained to generate an ordered list of possible pronunciations
from a spelled word. Training is carried out on a database consisting
of spelled words paired with their pronunciations (in a particular
language). We show how phonotactic information can be learned by a
second set of decision trees, which reorder the multiple pronunciations
generated by the first set. The paper defines the ``inclusion'' metric
for scoring phoneticizers that generate multiple pronunciations. Experimental
results employing this metric indicate that phonotactic reordering
yields a slight improvement when only the top pronunciation is retained,
and a large improvement when more than one hypothesis is retained.
Isolated-word recognition results which show good performance for automatically-generated
pronunciations are given.
|