Authors:
Andrew P. Breen, BT Labs, Ecole Nationale Superieure des Telecommunications de Bretagne (France)
O. Gloaguen, BT Labs, Ecole Nationale Superieure des Telecommunications de Bretagne (France)
P. Stern, BT Labs, Ecole Nationale Superieure des Telecommunications de Bretagne (France)
Page (NA) Paper number 390
Abstract:
The subject of computer generated virtual characters is a diverse and
rapidly developing field, with a wide variety of applications in industries
as varied as entertainment, education and advertising. Many of these
applications require or would be greatly enhanced by having the virtual
characters speak with the recorded voice of a real person. Such an
ability is particularly useful in applications where users are interacting
via avatars in real time in a virtual world. There are three basic
problems which need to be addressed when developing an interface which
has this functionality. *) The process must be capable of animating
mouth shapes in real time. *) The process should not mouth extraneous
sounds such as music, doors slamming etc. To do so would diminish the
effectiveness of the illusion. *) The mouth shapes produced by the
avatar should approximate that of the speaker. This paper describes
a series of experiments which attempt to address each of the points
outlined above. The experimental procedures are based around a real
time low computation approach which relies on a particular variety
of neural network known as the Single Layer Look Up Perceptron (SLLUP).
Authors:
Phil R. Cohen, Oregon Graduate Institute (USA)
Michael Johnston, Oregon Graduate Institute (USA)
David McGee, Oregon Graduate Institute (USA)
Sharon L. Oviatt, Oregon Graduate Institute (USA)
Joshua Clow, Oregon Graduate Institute (USA)
Ira Smith, Oregon Graduate Institute (USA)
Page (NA) Paper number 571
Abstract:
This paper reports on a case study comparison of a direct-manipulation-based
graphical user interface (GUI) with the QuickSet pen/voice multimodal
interface for supporting the task of military force "laydown." In this
task, a user places military units and "control measures," such as
various types of lines, obstacles, objectives, etc., on a map. A military
expert designed his own scenario and entered it via both interfaces.
Usage of QuickSet led to a speed improvement of 3.2 to 8.7-fold, depending
on the kind of object being created. These results suggest that there
may be substantial efficiency advantages to using multimodal interaction
over GUIs for map-based tasks.
Authors:
László Czap, University of Miskolc, Department of Automation (Hungary)
Page (NA) Paper number 445
Abstract:
Some research questions regarding the speech perception can only be
answered with natural speech stimuli especially in noisy environment.
In this paper we are going to answer a couple of questions on visual
support of audio signal at speech recognition. How much support can
give the video signal for the audio one? The impact of nature of the
noise. How can help the visual information to identify the place of
articulation? Does the voices of different class of excitation get
the same visual support? In order to answer these questions we have
performed intelligibility study on consonants between the same vowel
supported or not by the speaker's image with different signal to noise
ratio. The noise is either white noise or a mix of other speakers'
voice.
Authors:
Simon Downey, BT Labs (U.K.)
Andrew P. Breen, BT Labs (U.K.)
Maria Fernandez, BT Labs (U.K.)
Edward Kaneen, BT Labs (U.K.)
Page (NA) Paper number 391
Abstract:
Recent developments in distributed system processing have opened the
doors to the running of highly complex systems over a number of networked
computers. This enables the complexity of a system to be hidden behind
a small, lightweight user interface - for example a downloadable web
page. The Maya system makes use of such interfaces to combine the functionality
of speech recognition, synthesis, robust parsing, text generation and
dialogue management into a highly flexible multimodal architecture,
working in real time. This paper describes the development of the
architecture and interfaces to each system component. The configuration
of the system to particular tasks is discussed, making use of an email
secretary task as an example. Once configured, the system is able to
adopt all the functionality of a conventional email system and extend
these capabilities by allowing complex queries to be made about mail
messages.
Authors:
Mauro Cettolo, ITC-IRST (Italy)
Daniele Falavigna, ITC-IRST (Italy)
Page (NA) Paper number 332
Abstract:
The work reported in the paper will concern the assessment of a set
of modifications applied to the continuous speech recognizer developed
at IRST for dictation tasks. The objective of the proposed modifications
is to improve the recognizer performance on a corpus of human-human
dialogues, spontaneously uttered. Some solutions will be given to
increase Automatic Speech Recognition (ASR) robustness with respect
to typical spontaneous speech phenomena such as: breaths, coughs, filled
and silent pauses and speaking rate variations. Both gender independent
and dependent models are used. Specific models of extra-linguistic
phenomena are trained and a method for coping with speaking rate variations
will be proposed. Different recognizers, corresponding to males, females
and to various speaking rate factors, are combined together so as a
unique search space is defined. Best performance, obtained on a corpus
of hundreds of person-to-person spontaneously uttered dialogues, gives
26.1% Word Error Rate (WER).
Authors:
Georg Fries, Deutsche Telekom Berkom GmbH (Germany)
Stefan Feldes, Deutsche Telekom Berkom GmbH (Germany)
Alfred Corbet, Deutsche Telekom Berkom GmbH (Germany)
Page (NA) Paper number 1024
Abstract:
Taking into account that the user acceptance of an animated agent is
influenced by different criteria, including appropriateness of the
application domain, quality of application design, and quality of character
design, we have developed a demo application in a domain where we found
the agent substantially helpful - a web-based city guide. The animated
character interactively guides the user through some sights of the
city of Darmstadt. It can display and explain how to get to different
places of interest by moving around and pointing at locations on a
map. The system allows input via mouse clicks, speech and typed text.
Output modalities of the agent are speech, gesture, text (cartoon word
balloons) and some facial expression. Further we have designed two
new 3D characters. We describe experiences gained during system development
and discuss design aspects concerning application as well as character
animation.
Authors:
Rika Kanzaki, ATR Human Information Processing Research Laboratories (Japan)
Takashi Kato, Kansai University (Japan)
Page (NA) Paper number 243
Abstract:
To investigate the nature of facial information involved in the integration
of audiovisual speech perception, we examined the influence of facial
views on the McGurk effect under two auditory conditions. While the
speech perception of most of the audiovisual syllables used was little
affected by the facial views and the auditory noise, a stronger McGurk
effect was obtained for the 3/4-view image uttering labial sounds presented
with auditory syllables of alveolars under the auditory noise. However,
the facial view did not affect the visual identification of labials
in the same way. These results suggest that the information about labial
or nonlabial is not the only facial information involved in the McGurk
effect. It appears some other information available only in the 3/4-view
image might also be involved in the McGurk effect. Implications for
the processing of visual, auditory and audiovisual speech are discussed.
Authors:
Tom Brøndsted, Aalborg University (Denmark)
Lars Bo Larsen, Aalborg University (Denmark)
Michael Manthey, Aalborg University (Denmark)
Paul McKevitt, Aalborg University (Denmark)
Thomas B. Moeslund, Aalborg University (Denmark)
Kristian G. Olesen, Aalborg University (Denmark)
Page (NA) Paper number 767
Abstract:
The present paper presents a generic environment for intelligent multi
media applications, denoted "The Intellimedia Work-Bench". The aim
of the workbench is to facilitate development and research within the
field of multi modal user interaction. Physically it is a table with
various devices mounted above and around. These include: A camera and
a laser projector mounted above the workbench, a microphone array mounted
on the walls of the room, a speech recogniser and a speech synthesiser.
The camera is attached to a vision system capable of locating various
objects placed on the workbench. The paper presents two applications
utilising the workbench. One is a campus information system, allowing
the user to ask for directions within a part of the university campus.
The second application is a pool trainer, intended to provide guidance
to novice players.
Authors:
Joshua Clow, Oregon Graduate Institute (USA)
Sharon L. Oviatt, Oregon Graduate Institute (USA)
Page (NA) Paper number 50
Abstract:
In this paper we describe a new automated suite of tools for capturing
and analyzing data on multimodal systems called STAMP.1 STAMP is designed
to support research and development efforts for advancing next-generation
multimodal systems. STAMP permits researchers to analyze multimodal
system performance by: (1) recording data on users' multimodal input
and the system's responding, (2) supporting flexible replay of these
multimodal commands, along with n-best recognition lists for the individual
modalities and their combined multimodal interpretation, and (3) supporting
automated analysis using different metrics of multimodal system performance.
This collection of tools currently is being used to conduct basic research
on the characteristics of multimodal systems, and also to iterate different
aspects of the Quickset multimodal architecture.
Authors:
Sumi Shigeno, Kitasato University (Japan)
Page (NA) Paper number 1057
Abstract:
Cultural similarities and differences were compared between Japanese
and North American subjects in the recognition of emotion. Seven native
Japanese and five native North Americans (four Americans and one Canadian)
subjects participated in the experiments. The materials were five
meaningful words or short-sentences in Japanese and American English.
Japanese and American actors made vocal and facial expression in order
to transmit six basic emotions- happiness, surprise, anger, disgust,
fear, and sadness. Three presentation conditions were used-auditory,
visual, and audio-visual. The audio-visual stimuli were made by dubbing
the auditory stimuli on to the visual stimuli. The results show: (1)
subjects can more easily recognize the vocal expression of a speaker
who belongs to their own culture, (2) Japanese subjects are not good
at recognizing 'fear' in both the auditory-alone and visual-alone conditions,
(3) and both Japanese and American subjects identify the audio-visually
incongruent stimuli more often as a visual label rather than as an
auditory label. These results suggest that it is difficult to identify
the emotion of a speaker from a different culture and that people will
predominantly use visual information to identify emotion.
Authors:
Toshiyuki Takezawa, ATR Interpreting Telecommunications Research Laboratories (Japan)
Tsuyoshi Morimoto, Fukuoka University (Japan)
Page (NA) Paper number 958
Abstract:
We have built a multimodal-input multimedia-output guidance system
called MMGS. The input of a user can be a combination of speech and
hand-written gestures. The system, on the other hand, outputs a response
that combines speech, three-dimensional graphics, and/or other information.
This system can interact cooperatively with the user by resolving
ellipses/anaphora and various ambiguities such as those caused by speech
recognition errors. It is currently implemented on a SGI workstation
and achieves nearly real-time processing.
Authors:
Oscar Vanegas, Nagoya Institute of Technology (Japan)
Akiji Tanaka, Nagoya Institute of Technology (Japan)
Keiichi Tokuda, Nagoya Institute of Technology (Japan)
Tadashi Kitamura, Nagoya Institute of Technology (Japan)
Page (NA) Paper number 789
Abstract:
This paper describes intensity and location normalization techniques
for improving the performance of visual speech recognizers used in
audio-visual speech recognition. For auditory speech recognition,
there exist many methods for dealing with channel characteristics and
speaker individualities, e.g., CMN (cepstral mean normalization), SAT
(speaker adaptive training). We present two techniques similar to
CMN and SAT, respectively, for intensity and location normalization
in visual speech recognition. For the intensity normalization, the
mean value over the image sequence is subtracted from each pixel of
the image secuence. For the location normalization, the training and
the testing processes are carried out by finding the lip position with
the highest likelihood of each utterance for HMMs. Word recognition
experiments based on HMM show that a significant improvement in recognition
performance is achieved by combining the two techniques.
Authors:
Yanjun Xu, Inst Acoustics, Chinese Acad Sci (China)
Limin Du, Inst Acoustics, Chinese Acad Sci (China)
Guoqiang Li, Inst Acoustics, Chinese Acad Sci (China)
Ziqiang Hou, Inst Acoustics, Chinese Acad Sci (China)
Page (NA) Paper number 187
Abstract:
Visual feature extraction method now becomes the key technique in automatic
speechreading systems. However it still remains a difficult problem
due to large inter-person and intra-person appearance variabilities.
In this paper, we extend the normal active shape model to a hierarchy
probability-based framework, which can model a complex shape, such
as human face. It decomposes the complex shape into two layers: the
global shape including the position, scale and rotation of local shapes
(such as eyes, nose, mouth and chin); the local simple shape in normal
form. The two layers describe the global variation and local variation
respectively, and are combined into a probability framework. It can
perform fully automatic facial features locating in speechreading,
or face recognition.
Authors:
Jörn Ostermann, AT&T Labs - Research (USA)
Mark C. Beutnagel, AT&T Labs - Research (USA)
Ariel Fischer, Eurecom/EPFL (France)
Yao Wang, Polytechnic University, Brooklyn (USA)
Page (NA) Paper number 931
Abstract:
The integration of text-to-speech (TTS) synthesis and animation of
synthetic faces allows new applications like visual human computer
interfaces using agents or avatars. The TTS informs the talking head
when phonemes are spoken. The appropriate mouth shapes are animated
and rendered while the TTS produces the sound. We call this integrated
system of TTS and animation a Visual TTS (VTTS). This paper describes
the architecture on an integrated VTTS synthesizer that allows defining
facial expressions as bookmarks in the text that will be animated while
the model is talking. The position of a bookmark in the text defines
the start time for the facial expression. The bookmark itself names
the expression, its amplitude and the duration during which the amplitude
has to be reached by the face. A bookmark to face animation parameter
(FAP) converter creates a curve defining the amplitude for the given
FAP over time using Hermite functions of 3rd order [http://www.research.att.com/info/osterman].
|