Integrated Multimedia Processing and Human Computer Interface

Chair: Y. Wang, Polytechnic University, USA

Home


Discriminative Training of HMM Stream Exponents for Audio-Visual Speech Recognition

Authors:

Gerasimos Potamianos, AT&T Labs (U.S.A.)
Hans Peter Graf, AT&T Labs (U.S.A.)

Volume 6, Page 3733, Paper number 1802

Abstract:

We propose the use of discriminative training by means of the generalized probabilistic descent (GPD) algorithm to estimate hidden Markov model (HMM) stream exponents for audio-visual speech recognition. Synchronized audio and visual features are used to respectively train audio-only and visual-only single-stream HMMs of identical topology by maximum likelihood. A two-stream HMM is then obtained by combining the two single-stream HMMs and introducing exponents that weigh the log-likelihood of each stream. We present the GPD algorithm for stream exponent estimation, consider a possible initialization, and apply it to the single speaker connected letters task of the AT&T bimodal database. We demonstrate the superior performance of the resulting multi-stream HMM to the audio-only, visual-only, and audio-visual single-stream HMMs.

ic981802.pdf (From Postscript)

TOP



A Hybrid Real-Time Face Tracking System

Authors:

Ce Wang, Harvard University (U.S.A.)
Michael S Brandstein, Harvard University (U.S.A.)

Volume 6, Page 3737, Paper number 1917

Abstract:

A hybrid real-time face tracker based on both sound and visual cues is presented. Initial talker locations are estimated acoustically from microphone array data while precise localization and tracking are derived from image information.A computationally efficient algorithm for face detection via motion analysis is employed to track individual faces at rates up to 30 frames per second. The system is robust to nonlinear source motions, complex backgrounds, varying lighting conditions, and a variety of source-camera depths. While the direct focus of this work is automated video conferencing, the face tracking capability has utility to many multimedia and virtual reality applications.

ic981917.pdf (From Postscript)

TOP



A Hidden Markov Model Framework for Video Segmentation Using Audio and Image Features

Authors:

John S Boreczky, FX Palo Alto Laboratory (U.S.A.)
Lynn D Wilcox, FX Palo Alto Laboratory (U.S.A.)

Volume 6, Page 3741, Paper number 2217

Abstract:

This paper describes a technique for segmenting video using hidden Markov models (HMM). Video is segmented into regions defined by shots, shot boundaries, and camera movement within shots. Features for segmentation include an image-based distance between adjacent video frames, an audio distance based on the acoustic difference in intervals just before and after the frames, and an estimate of motion between the two frames. Typical video segmentation algorithms classify shot boundaries by computing an image-based distance between adjacent frames and comparing this distance to fixed, manually determined thresholds. Motion and audio information is used separately. In contrast, our segmentation technique allows features to be combined within the HMM framework. Further, thresholds are not required since automatically trained HMMs take their place. This algorithm has been tested on a video data base, and has been shown to improve the accuracy of video segmentation over standard threshold-based systems.

ic982217.pdf (From Postscript)

TOP



Text-to-Visual Speech Synthesis Based on Parameter Generation from HMM

Authors:

Takashi Masuko, Tokyo Institute of Technology (Japan)
Takao Kobayashi, Tokyo Institute of Technology (Japan)
Masatsune Tamura, Tokyo Institute of Technology (Japan)
Jun Masubuchi, Tokyo Institute of Technology (Japan)
Keiichi Tokuda, Nagoya Institute of Technology (Japan)

Volume 6, Page 3745, Paper number 1706

Abstract:

This paper presents a new technique for synthesizing visual speech from arbitrarily given text. The technique is based on an algorithm for parameter generation from HMM with dynamic features, which has been successfully applied to text-to-speech synthesis. In the training phase, syllable HMMs are trained with visual speech parameter sequences that represent lip movements. In the synthesis phase, a sentence HMM is constructed by concatenating syllable HMMs corresponding to the phonetic transcription for the input text. Then an optimum visual speech parameter sequence is generated from the sentence HMM in ML sense. The proposed technique can generate synchronized lip movements with speech in a unified framework. Furthermore, coarticulation is implicitly incorporated into generated mouth shapes. As a result, synthetic lip motion becomes smooth and realistic.

ic981706.pdf (From Postscript)

TOP



Digital Processing of Affective Signals

Authors:

Jennifer A Healey, MIT Media Lab (U.S.A.)
Rosalind W Picard, MIT Media Lab (U.S.A.)

Volume 6, Page 3749, Paper number 2285

Abstract:

Affective signal processing algorithms were developed to allow a digital computer to recognize the affective state of a user who is intentionally expressing that state. This paper describes the method used for collecting the training data, the feature extraction algorithms used and the results of pattern recognition using a Fisher linear discriminant and the leave one out test method. Four physiological signals, skin conductivity, blood volume pressure, respiration and an electromyogram (EMG) on the masseter muscle were analyzed. It was found that anger was well differentiated from peaceful emotions (90%-100%), that high and low arousal states were distinguished (80%-88%), but positive and negative valence states were difficult to distinguish (50%-82%). Subsets of three emotion states could be well separated (75%-87%) and characteristic patterns for single emotions were found.

ic982285.pdf (From Postscript)

TOP



Immersive Audio for the Desktop

Authors:

Chris Kyriakakis, University of Southern California (U.S.A.)
Tomlinson Holman, University of Southern California (U.S.A.)

Volume 6, Page 3753, Paper number 2337

Abstract:

Integrated media workstations are increasingly being used for creating, editing, and monitoring sound that is associated with video or computer-generated images. While the requirements for high quality reproduction in large-scale systems are well understood, these have not yet been adequately translated to the workstation environment. In this paper we discuss several factors that pertain to high quality sound reproduction at the desktop including acoustical considerations, signal processing requirements, and listener location issues. We also present a novel desktop system design with integrated listener-tracking capability that circumvents several of the problems faced by current digital audio and video workstations.

ic982337.pdf (From Postscript)

TOP



Speech Interaction in Virtual Reality

Authors:

Johannes Mueller, Munich University of Technology (Germany)
Christian Krapichler, GSF, Neuherberg (Germany)
Lam Son Nguyen, Munich University of Technology (Germany)
Karl-Hans Englmeier, GSF, Neuherberg (Germany)
Manfred Lang, Munich University of Technology (Germany)

Volume 6, Page 3757, Paper number 1076

Abstract:

A system for the visualization of three-dimensional anatomical data, derived from Magnetic Resonance Imaging (MRI) or Computed Tomography (CT), enables the physician to navigate through and interact with the patient's 3D scans in a virtual environment. This paper presents the multimodal human-machine interaction focusing the speech input. For the concerning task, a speech understanding front-end using a special kind of semantic decoder was successfully adopted. Now, the navigation as well as certain parameters and functions can be directly accessed by spoken commands. Using the implemented interaction modalities, speed and efficiency of the diagnosis could be considerably improved.

ic981076.pdf (From Postscript)

TOP



Word Learning in a Multimodal Environment

Authors:

Deb K Roy, MIT Media Lab (U.S.A.)
Alex Pentland, MIT Media Lab (U.S.A.)

Volume 6, Page 3761, Paper number 2153

Abstract:

We are creating human machine interfaces which let people communicate with machines using natural modalities including speech and gesture. Aproblem with current multimodal interfaces is that users are forced to learn the set of words and gestures which the interface understands.We report on a trainable interface which lets the user teach the system words of their choice through natural multimodal interactions.

ic982153.pdf (From Postscript)

TOP