Authors:
Levent M. Arslan, Entropic Inc. (USA)
David Talkin, Entropic Inc. (USA)
Page (NA) Paper number 110
Abstract:
A novel algorithm is proposed which generates three-dimensional face
point trajectories for a given speech file. The proposed algorithm
first employs an off-line training phase. In this phase, recorded
face point trajectories along with their speech data and phonetic labels
are used to generate phonetic codebooks. These codebooks consist of
both acoustic and visual features. During the synthesis stage, speech
input is rated in terms of its similarity to the codebook entries,
and a weight is assigned to each codebook entry. If the phonetic information
about the test speech is available, this is utilized in restricting
the codebook search to only several codebook entries which are visually
closest to the current phoneme. These weights are then used to synthesize
the principal components of the face point trajectory. The performance
of the algorithm is tested on held-out data, and the synthesized face
point trajectories showed a correlation of 0.73 with true face point
trajectories.
Authors:
Eli Yamamoto, Nara Institute of Science and Technology (Japan)
Satoshi Nakamura, Nara Institute of Science and Technology (Japan)
Kiyohiro Shikano, Nara Institute of Science and Technology (Japan)
Page (NA) Paper number 756
Abstract:
This paper proposes a method to re-estimate output visual parameters
for speech-to-lip movement synthesis using audio-visual Hidden Markov
Models (HMMs) under the Expectation-Maximization(EM) algorithm. In
the conventional methods for speech-to-lip movement synthesis, there
is a synthesis method estimating a visual parameter sequence through
the Viterbi alignment of an input acoustic speech signal using audio
HMMs. However, the HMM-Viterbi method involves a substantial problem
that incorrect HMM state alignment may output incorrect visual parameters.
The problem in the HMM-Viterbi method is caused by the deterministic
synthesis process to assign a single HMM state for an input audio frame.
The proposed method avoids the deterministic process by re-estimating
non-deterministic visual parameters while maximizing the likelihood
of the audio-visual observation sequence under the EM algorithm.
Authors:
Deb Roy, MIT Media Laboratory (USA)
Alex Pentland, MIT Media Laboratory (USA)
Page (NA) Paper number 551
Abstract:
We present a computational model of sensory-grounded language acquisition.
Words are learned from naturally spoken multiword utterances paired
with color images of objects. Speech recognition and computer vision
algorithms are used to build representations of the input speech and
images. Words are learned by first clustering images along shape and
color dimensions. A search algorithm then finds speech segments within
the continuous multiword input speech which co-occur with each visual
cluster. The learned words can be used in a speech understanding task
to request images based on spoken descriptions and in a speech generation
task to automatically generate spoken descriptions of images. Although
simple in its current form, this model is a first step towards a more
complete, fully-grounded model of language acquisition. Practical applications
include adaptive human-machine interfaces based on spoken language
for information browsing, assistive technologies, education, and entertainment.
Authors:
Stéphane Dupont, Faculte Polytechnique de Mons (FPMs) (Belgium)
Juergen Luettin, Institut Dalle Molle d'Intelligence Artificielle Perceptive (IDIAP) (Switzerland)
Page (NA) Paper number 582
Abstract:
The Multi-Stream automatic speech recognition approach was investigated
in this work as a framework for Audio-Visual data fusion and speech
recognition. This method presents many potential advantages for such
a task. It particularly allows for synchronous decoding of continuous
speech while still allowing for some asynchrony of the visual and acoustic
information streams. First, the Multi-Stream formalism is briefly recalled.
Then, on top of the Multi-Stream motivations, experiments on the M2VTS
multimodal database are presented and discussed. To our knowledge,
these are the first experiments addressing multi-speaker continuous
Audio-Visual Speech Recognition (AVSR). It is shown that the Multi-Stream
approach can yield improved Audio-Visual speech recognition performance
when the acoustic signal is corrupted by noise as well as for clean
speech.
|