Authors:
Ingrid Ahmer, University of South Australia (Australia)
Robin W. King, University of South Australia (Australia)
Page (NA) Paper number 419
Abstract:
The purpose of this research is to investigate methods for applying
speech recognition techniques to improve the productivity of off-line
captioning for television. We posit that existing corpora for training
continuous speech recognisers are unrepresentative of the acoustic
conditions of television soundtracks. To evaluate the use of application
specific models to this task we have developed a soundtrack corpus
(representing a single genre of television programming) for acoustic
analysis and a text corpus (from the same genre) for language modelling.
These corpora are built from components of the manual captioning process.
Captions were used to automatically segment and label the acoustic
soundtrack data at sentence level, with manual post-processing to classify
and verify the data. The text corpus was derived using automatic processing
from approximately 1 million words of caption text. The results confirm
the acoustic profile of the task to be characteristically different
to that of most other speech recognition tasks (with the soundtrack
corpus being almost devoid of clean speech). The text corpus indicates
that application specific language modelling will be effective for
the chosen genre, although a lexicon providing complete lexical coverage
is unattainable. There is a high correspondence between captions and
soundtrack speech for the chosen genre, confirming that closed-captions
can be a useful data source for generating labelled acoustic data.
The corpora provide a high quality resource to support further research
into automated speech recognition.
Authors:
Fabrice Lefèvre, LIP6 (France)
Claude Montacié, LIP6 (France)
Marie-José Caraty, LIP6 (France)
Page (NA) Paper number 573
Abstract:
The delta coefficients are a conventional method to include temporal
information in the speech recognition systems. In particular, they
are widely used in the gaussian HMM-based systems. Some attempts were
made to introduce the delta coefficients in the K-Nearest Neighbours
(K-NN) HMM-based system that we recently developed. An introduction
of the delta coefficients directly in the representation space is shown
not to be suitable with the K-NN probability density function (pdf)
estimator. So, we investigate whether the delta coefficient could be
used to improve the K-NN HMM-based system in other ways. In this purpose,
an analysis of the delta coefficients in the gaussian HMM-based systems
is proposed. It leads to the conclusion that the delta coefficients
influence also the recognition process.
Authors:
Raymond Low, The University of Western Australia (Australia)
Roberto Togneri, The University of Western Australia (Australia)
Page (NA) Paper number 645
Abstract:
A novel technique for speaker independent automated speech recognition
is proposed. We take a segment model approach to Automated Speech Recognition
(ASR), considering the trajectory of an utterance in vector space,
then classify using a modified Probabilistic Neural Network (PNN) and
maximum likelihood rule. The system performs favourably with established
techniques. Our system achieves in excess of 94% with isolated digit
recognition, 88% with isolated alphabetic letters, and 83% with the
confusable /e/ set. A favourable compromise between recognition accuracy
and computer memory and speech can also be reached by performing clustering
on the training data for the PNN.
Authors:
Imed Zitouni, LORIA / INRIA-Lorraine (France)
Page (NA) Paper number 727
Abstract:
In contrast to conventional n-gram approches, which are the most used
language model in continuous speech recognition system, the multigram
approach models a stream of variable-length sequences. To overcome
the independence assumption in classical multigram, we propose in this
paper a hierarchical model which successively relaxes this assumption.
We called this model: Mnv. The estimation of the model parameters
can be formulated as a Maximum Likelihood estimation problem from incomplete
data used at different levels (j in 1...v). We show that estimates
of the model parameters can be computed through an iterative Expectation-Maximization
algorithm. A few experimental tests were carried out on a corpus extracted
from the French ``Le Monde''. Results show that Mnv outperforms based
multigram and interpolated bigram but are comparable to the interpolated
trigram model.
Authors:
Michiko Watanabe, University of Tokyo (Japan)
Page (NA) Paper number 812
Abstract:
In second language input studies, speaking speed is regarded as one
of the most influential factors in comprehension. However, research
in this area has mainly been conducted on written texts read aloud.
The present study investigated temporal variables, such as articulation
rate and ratio and frequency of fillers and silent pauses, in three
university lectures given in Japanese. It was found that the total
duration ratio of fillers was as great as that of silent pauses. It
also became clear that, for individual speakers, articulation rate
and frequency of fillers are relatively constant, while frequency of
silent pauses varies depending on discourse section. Of total pause
ratio, pause frequency and articulation rate, the latter correlated
best with listener ratings of speech speed. The findings suggest that
spontaneous speech requires methods of speech speed measurement different
from those for read speech.
Authors:
Matthew Aylett, Human Communication Research Centre, University of Edinburgh (U.K.)
Page (NA) Paper number 823
Abstract:
Vowel space data (A two dimensional F1/F2 plot) is of interest to phoneticians
for the purpose of comparing different accents, languages, speaker
styles and individual speakers. Current automatic methods used by speech
technologists do not generally produce traditional vowel space models;
instead they tend to produce hyper dimensional code books covering
the entire speakers speech stream. This makes it difficult to relate
results generated by these methods to observations in laboratory phonetics.
In order to address these problems a model was developed based on a
mixture Gaussian density function fitted using expectation maximisation
on F1/F2 data producing a probability distribution in F1/F2 space.
Speech was pre-processed using voicing to automatically excerpt vowel
data without any need for segmentation and a parametric fit algorithm
was applied to calculate likely vowel targets. The result was a clear
visualisation of a speaker's vowel space requiring no segmented or
labelled speech.
Authors:
Michelle Minnick Fox, Department of Linguistics, University of Pennsylvania (USA)
Page (NA) Paper number 911
Abstract:
Programs for testing and training of difficult vowel distinctions in
American English were created for subjects to access via the Internet
using a web browser. The testing and training data include many likely
vowel confusions for speakers of different L1s. The training program
focuses on one distinction at a time, and adjusts to concentrate on
particular contexts or exemplars that are difficult for the individual
subject. In the current study, 52 subjects participated in testing
and 2 subjects participated in training. In the testing portion, results
indicate that the L1 and the fluency level in English, as well as individual
variability, have an effect on perceptual ability. In the training
portion, subjects showed improvement on the contrasts on which they
trained. Because these programs make extensive data collection over
large populations and large distances easy, this method of research
will facilitate further investigation of questions regarding second
language acquisition.
Authors:
Najam Malik, School of Electrical Engineering, The University of New South Wales, Sydney (Australia)
W. Harvey Holmes, School of Electrical Engineering, The University of New South Wales, Sydney (Australia)
Page (NA) Paper number 1026
Abstract:
Over frames of short time duration, filtered speech may be described
as a finite linear combination of sinusoidal components. In the case
of a frame of voiced speech the frequencies are considered to be harmonics
of a fundamental frequency. It can be assumed further that the speech
samples are observed in additive white noise of zero mean, resulting
in a standard signal-plus-noise model. This model has a nonlinear dependence
on the frequencies of the sinusoids but is linear in their coefficients.
We use subspace line spectral estimation methods of Pisarenko and Prony
type to estimate the frequencies and use the results in voiced-unvoiced
classification and pitch estimation, followed by analysis of the speech
waveform into its sinusoidal components.
Authors:
Petra Hansson, Lund University (Sweden)
Page (NA) Paper number 1042
Abstract:
Pauses in spontaneous speech have a less restricted distribution than
pauses in read discourse; however, they are not distributed in a haphazard
way. The majority of the perceived pauses in the examined Swedish spontaneous
speech material, 73%, occurred in one of the following positions: between
sentences, after discourse markers and conjunctions, and before accented
content words. There is a range of acoustic correlates of perceived
pauses in spontaneous speech, such as silent intervals, hesitation
sounds, prepausal lengthening, glottalization and specific F0 patterns.
The acoustic manifestation of a pause, e.g. the duration of the pause
and the F0 pattern associated with the pause, is to some extent dependent
on the pause's position and function.
Authors:
Elisabeth Zetterholm, Lund university, Dept of Linguistics and Phonetics (Sweden)
Page (NA) Paper number 1043
Abstract:
Terms for voice quality or phonation types for use in normal speech
often come from studies of pathological speech (laryngeal settings)
and it is hard to describe voice quality, especially the variations
of a normal voice. In normal speech we use different voice qualities
both for linguistic distinctions in some languages, prosodically as
a boundary signal, socially depending on social and regional variants
and paralinguistically in attitudes and emotions. This paper shows
some reference types of voice qualities, recorded by a trained phonetician,
and their acoustic correlates. In a pilot study a male actor recorded
four attitudinally neutral sentences using five different emotions
which are being compared to his neutral voice. It is evident that voice
quality, as well as rhythm and intonation, plays an important role
in giving the impression of different emotions.
Authors:
Julie Lunn, Queen Margaret College (U.K.)
Alan A. Wrench, Queen Margaret College (U.K.)
Janet Mackenzie Beck, Queen Margaret College (U.K.)
Page (NA) Paper number 1118
Abstract:
The production of /l/ is examined for pre- and post-operative patients
who have undergone surgery in three distinct areas (anterior, posterior
or lateral tongue) followed by radiotherapy and reconstruction. Results
show F1 and F2 to be raised after surgery in all cases. Normalised
measures of tongue height (F1-F0) and extension (F2-F1) revealed no
significant change after surgery to the side of the tongue but in the
other two categories, results indicated a change normally associated
with both raising and fronting of the tongue. The paper compares these
results with findings from other studies and considers possible mechanisms
for the observed changes.
|