Spacer ICASSP '98 Main Page

Spacer
General Information
Spacer
Conference Schedule
Spacer
Technical Program
Spacer
    Overview
    50th Annivary Events
    Plenary Sessions
    Special Sessions
    Tutorials
    Technical Sessions
    
By Date
    May 12, Tue
May 13, Wed
May 14, Thur
May 15, Fri
    
By Category
    AE    ANNIV   
COMM    DSP   
IMDSP    MMSP   
NNSP    PLEN   
SP    SPEC   
SSAP    UA   
VLSI   
    
By Author
    A    B    C    D    E   
F    G    H    I    J   
K    L    M    N    O   
P    Q    R    S    T   
U    V    W    X    Y   
Z   

    Invited Speakers
Spacer
Registration
Spacer
Exhibits
Spacer
Social Events
Spacer
Coming to Seattle
Spacer
Satellite Events
Spacer
Call for Papers/
Author's Kit

Spacer
Future Conferences
Spacer
Help

Abstract -  MMSP2   


 
MMSP2.1

   
Discriminative Training of HMM Stream Exponents for Audio-Visual Speech Recognition
G. Potamianos, H. Graf  (AT&T Labs, USA)
We propose the use of discriminative training by means of the generalized probabilistic descent (GPD) algorithm to estimate hidden Markov model (HMM) stream exponents for audio-visual speech recognition. Synchronized audio and visual features are used to respectively train audio-only and visual-only single-stream HMMs of identical topology by maximum likelihood. A two-stream HMM is then obtained by combining the two single-stream HMMs and introducing exponents that weigh the log-likelihood of each stream. We present the GPD algorithm for stream exponent estimation, consider a possible initialization, and apply it to the single speaker connected letters task of the AT&T bimodal database. We demonstrate the superior performance of the resulting multi-stream HMM to the audio-only, visual-only, and audio-visual single-stream HMMs.
 
MMSP2.2

   
A Hybrid Real-Time Face Tracking System
C. Wang, M. Brandstein  (Harvard University, USA)
A hybrid real-time face tracker based on both sound and visual cues is presented. Initial talker locations are estimated acoustically from microphone array data while precise localization and tracking are derived from image information. A computationally efficient algorithm for face detection via motion analysis is employed to track individual faces at rates up to 30 frames per second. The system is robust to nonlinear source motions, complex backgrounds, varying lighting conditions, and a variety of source-camera depths. While the direct focus of this work is automated video conferencing, the face tracking capability has utility to many multimedia and virtual reality applications.
 
MMSP2.3

   
A Hidden Markov Model Framework for Video Segmentation Using Audio and Image Features
J. Boreczky, L. Wilcox  (FX Palo Alto Lab, USA)
This paper describes a technique for segmenting video using hidden Markov models (HMM). Video is segmented into regions defined by shots, shot boundaries, and camera movement within shots. Features for segmentation include an image-based distance between adjacent video frames, an audio distance based on the acoustic difference in intervals just before and after the frames, and an estimate of motion between the two frames. Typical video segmentation algorithms classify shot boundaries by computing an image-based distance between adjacent frames and comparing this distance to fixed, manually determined thresholds. Motion and audio information is used separately. In contrast, our segmentation technique allows features to be combined within the HMM framework. Further, thresholds are not required since automatically trained HMMs take their place. This algorithm has been tested on a video data base, and has been shown to improve the accuracy of video segmentation over standard threshold-based systems.
 
MMSP2.4

   
Text-to-Visual Speech Synthesis Based on Parameter Generation from HMM
T. Masuko, T. Kobayashi, M. Tamura, J. Masubuchi  (Tokyo Institute of Technology, Japan);   K. Tokuda  (Nagoya Institute of Technology, Japan)
This paper presents a new technique for synthesizing visual speech from arbitrarily given text. The technique is based on an algorithm for parameter generation from HMM with dynamic features, which has been successfully applied to text-to-speech synthesis. In the training phase, syllable HMMs are trained with visual speech parameter sequences that represent lip movements. In the synthesis phase, a sentence HMM is constructed by concatenating syllable HMMs corresponding to the phonetic transcription for the input text. Then an optimum visual speech parameter sequence is generated from the sentence HMM in ML sense. The proposed technique can generate synchronized lip movements with speech in a unified framework. Furthermore, coarticulation is implicitly incorporated into generated mouth shapes. As a result, synthetic lip motion becomes smooth and realistic.
 
MMSP2.5

   
Digital Processing of Affective Signals
J. Healey, R. Picard  (MIT Media Lab, USA)
Affective signal processing algorithms were developed to allow a digital computer to recognize the affective state of a user who is intentionally expressing that state. This paper describes the method used for collecting the training data, the feature extraction algorithms used and the results of pattern recognition using a Fisher linear discriminant and the leave one out test method. Four physiological signals, skin conductivity, blood volume pressure, respiration and an electromyogram (EMG) on the masseter muscle were analyzed. It was found that anger was well differentiated from peaceful emotions (90%-100%), that high and low arousal states were distinguished (80%-88%), but positive and negative valence states were difficult to distinguish (50%-82%). Subsets of three emotion states could be well separated (75%-87%) and characteristic patterns for single emotions were found.
 
MMSP2.6

   
Immersive Audio for the Desktop
C. Kyriakakis, T. Holman  (USC IMSC, USA)
Integrated media workstations are increasingly being used for creating, editing, and monitoring sound that is associated with video or computer-generated images. While the requirements for high quality reproduction in large-scale systems are well understood, these have not yet been adequately translated to the workstation environment. In this paper we discuss several factors that pertain to high quality sound reproduction at the desktop including acoustical considerations, signal processing requirements, and listener location issues. We also present a novel desktop system design with integrated listener-tracking capability that circumvents several of the problems faced by current digital audio and video workstations.
 
MMSP2.7

   
Speech Interaction in Virtual Reality
J. Mueller  (Munich University of Technology, Germany);   C. Krapichler  (GSF, Neuherberg, Germany);   L. Nguyen  (Munich University of Technology, Germany);   K. Englmeier  (GSF, Neuherberg, Germany);   M. Lang  (Munich University of Technology, Germany)
A system for the visualization of three-dimensional anatomical data, derived from Magnetic Resonance Imaging (MRI) or Computed Tomography (CT), enables the physician to navigate through and interact with the patient's 3D scans in a virtual environment. This paper presents the multimodal human-machine interaction focusing the speech input. For the concerning task, a speech understanding front-end using a special kind of semantic decoder was successfully adopted. Now, the navigation as well as certain parameters and functions can be directly accessed by spoken commands. Using the implemented interaction modalities, speed and efficiency of the diagnosis could be considerably improved.
 
MMSP2.8

   
Word Learning in a Multimodal Environment
D. Roy, A. Pentland  (MIT Media Lab, USA)
We are creating human machine interfaces which let people communicate with machines using natural modalities including speech and gesture. A problem with current multimodal interfaces is that users are forced to learn the set of words and gestures which the interface understands. We report on a trainable interface which lets the user teach the system words of their choice through natural multimodal interactions.
 

< Previous Abstract - MMSP1

MMSP3 - Next Abstract >