James L. Flanagan, Rutgers University (U.S.A.)
Ivan Marsic, Rutgers University (U.S.A.)
Multimedia interfaces are rapidly evolving to facilitate human/machine communication. Most of the technologies on which they are based are, as yet, imperfect. But, the interfaces do begin to allow information exchange in ways familiar and comfortable to the human--principally through natural actions in the sensory dimensions of sight, sound and touch. Further, as digital networking becomes ubiquitous, the opportunity grows for collaborative work through conferenced computing. In this context the machine takes on the role of mediator in human/machine/human communication--the ideal being to extend the intellectual abilities of humans through access to distributed information resources and collective decision making. The challenge is how to design machine mediation so that it extends, not impedes, human abilities. This report describes evolving work to incorporate multimodal interfaces into a networked system for collaborative distributed computing. It also addresses strategies for quantifying the synergies that may be gained.
Alex Waibel, Carnegie Mellon University (U.S.A.)
Bernhard Suhm, Carnegie Mellon University (U.S.A.)
Minh Tu Vo, Carnegie Mellon University (U.S.A.)
Jie Yang, Carnegie Mellon University (U.S.A.)
When humans communicate they take advantage of a rich spectrum of cues. Some are verbal and acoustic. Some are non-verbal and non-acoustic. Signal processing technology has devoted much attention to the recognition of speech, as a single human communication signal. Most other complementary communication cues, however, remain unexplored and unused in human-computer interaction. In this paper we show that the addition of non-acoustic or non-verbal cues can significantly enhance robustness, flexibility, naturalness and performance of human-computer interaction. We demonstrate computer agents that use speech, gesture, handwriting, pointing, spelling jointly for more robust, natural and flexible human-computer interaction in the various tasks of an information worker: information creation, access, manipulation or dissemination.
Alex Pentland, MIT Media Lab (U.S.A.)
We are working to develop smart networked environments that can help people in their homes, offices, cars, and when walking about. Our research is aimed at giving rooms, desks, and clothes the perceptual and cognitive intelligence needed to become active helpers.
Nikil Jayant, Bell Laboratories (U.S.A.)
Voice and gesture represent fundamental and universal modalities in interhuman communication. With recent advances in automatic methods of speech recognition and synthesis, human-machine interaction by voice is rapidly becoming a technological and commercial reality. Although less mature and deployed, gesture recognition by machine is becoming reliable enough to be considered as a serious supplement to the voice interface between humans and machines.
Tsuhan Chen, AT&T Labs - Research (U.S.A.)
Ram R. Rao, Georgia Institute of Technology (U.S.A.)
To many people, the word "multimedia" simply means the combination of various forms of information: text, speech, music, images, graphics and video. What is often overlooked is the interaction among these forms. In this paper, we will present our recent results in exploiting the audio-visual interaction that is very significant in multimedia communication. The applications include lip synchronization, joint audio-video coding, and person verification. We will present the enabling technologies, including audio-to-visual mapping and facial image analysis, for these applications. Our results show that the joint processing of audio and video provides advantages that are not available when audio and video are studied separately.
F. Lavagetto, University of Genova (Italy)
S. Lepsoy, University of Genova (Italy)
C. Braccini, University of Genova (Italy)
S. Curinga, University of Genova (Italy)
Recent advances in joint acoustical/visual analysis for model-based lip motion synthesis is presented. The 2D lip motion field is modeled as a linear combination of a low dimensional motion basis computed through Principal Component Analysis (PCA). The vector of PCA coefficients is expressed as a function of a limited set of articulatory parameters which describe the external appearance of the mouth. The acoustical processing estimates these articulatory parameters from the direct analysis of the speech waveform based on a neural processing stage, i.e. through a bank of Time Delay Neural Networks. The achieved results have been subjectively evaluated by visualizing the estimated motion on a wire-frame mouth template presented in synchronization with speech. The experiments carried out so far deal with single-speaker trained TDNNs and with single-speaker PCA, but suitable algorithms for generalizing the techniques are currently under investigation.
Hong Wang, PictureTel (U.S.A.)
Peter Chu, PictureTel (U.S.A.)
This paper describes the voice source localization algorithm used in the PictureTel automatic camera pointing system (LimeLight-TM, Dynamic Speech Locating Technology). The system uses an array of 46cm wide and 30cm high, which contains 4 microphones, and is mounted on top of the monitor. The three dimensional position of a sound source is calculated from the time delays of 4 pairs of microphones. In time delay estimation, the averaging of signal onsets of each frequency band is combined with phase correlation to reduce the influence of noise and reverberation. With this approach, it is possible to provide reliable three dimensional voice source localization by a small microphone array. Post processing based on a priori knowledge is also introduced to eliminate the influences of reflections from furniture such as tables. Results of speech source localization under real conference room conditions will be given. Some system related issues will also be discussed.
Akihito Akutsu, NTT Human Interface Laboratories (Japan)
Yoshinobu Tonomura, NTT Human Interface Laboratories (Japan)
Hiroshi Hamada, NTT Human Interface Laboratories (Japan)
Because digital video is becoming increasingly important for the networked multimedia society, the audio-visual access environment should allow us to do more than just passively watch. We propose a new video user interface concept made possible by multi-dimensional video computing. Multi-dimensional video computing offers a framework for analyzing a video, creating new structures, and restyling and visualizing the video according to the user's demands. The video interface visualizes video content and context structure comprehensibly to allow us to access the spatiotemporal information in videos intuitively. In this paper, we introduce our research activities toward a video interface based on the information extracted from the video. New video interfaces called VideoBrowser, PanoramaVideo, and VideoJigsaw are described.
Alexander G. Hauptmann, Carnegie Mellon University (U.S.A.)
Howard D. Wactlar, Carnegie Mellon University (U.S.A.)
The Informedia Digital Library Project allows full content indexing and retrieval of text, audio and video material. The integration of speech recognition, image processing, natural language processing and information retrieval overcomes limits in each technology to create a useful system. In order to answer the question how good speech recognition has to be in order to be useful and usable for indexing and retrieving speech recognizer generated transcripts, some empirical evidence is presented that illustrates the degradation of information retrieval at different levels of speech accuracy. In our experiments, word error rates up to 25% did not significantly impact information retrieval and error rates of 50% still provided 85 to 95% of the recall and precision relative to fully accurate transcripts in the same retrieval system.
Steve J. Young, Cambridge University Engineering Dept (U.K.)
Jonathan T. Foote, Cambridge University Engineering Dept (U.K.)
Gareth J.F Jones, Cambridge University Engineering Dept (U.K.)
Karen Spärck Jones, Cambridge University Computer Lab (U.K.)
Martin G. Brown, ORL Ltd (U.K.)
This paper reviews the Video Mail Retrieval (VMR) project at Cambridge University and ORL. The VMR project began in September 1993 with the aim of developing methods for retrieving video documents by scanning the audio soundtrack for keywords. The project has shown, both experimentally and through the construction of a working prototype, that speech recognition can be combined with information retrieval methods to locate multimedia documents by content. The final version of the VMR system uses pre-computed phone lattices to allow extremely rapid word spotting and audio indexing, and statistical information retrieval (IR) methods to mitigate the effects of spotting errors. The net result is a retrieval system that is open-vocabulary and speaker-independent, and which can search audio orders of magnitude faster than real time.
Francis Kubala, BBN (U.S.A.)
Hubert Jin, BBN (U.S.A.)
Long Nguyen, BBN (U.S.A.)
Richard Schwartz, BBN (U.S.A.)
Spyros Matsoukas, Northeastern University (U.S.A.)
In this paper we describe our recent work on automatic transcription of radio and television news broadcasts. This problem is very challenging for large vocabulary speech recognition because of the frequent and unpredictable changes that occur in speaker, speaking style, topic, channel, and background conditions. Faced with such a problem, there is a strong tendency to try to carve the input into separable classes and deal with each one independently. In our early work on this problem, however, we are finding that the rewards for condition-specific techniques are disappointingly small. This is forcing us to look for general, robust, and adaptive algorithms for dealing with extremely variable data. Herein, we describe the BBN BYBLOS recognition system configured to handle off-line transcription and we characterize the speech contained in the 1996 DARPA Hub-4 testbed. On the partitioned development test set, we achieved a 29.4% overall word error rate.
Ryohei Nakatsu, ATR-MIC (Japan)
In the areas of image/speech processing, researchers have long dreamed of producing computer agents that can communicate with people in a human-like way. Although the non-verbal aspects of communications, such as emotions-based communications, play very important roles in our daily lives, most research so far has concentrated on the verbal aspects of communications and has neglected the nonverbal aspects. To achieve human-like agents we have adopted a two-way approach. 1. To provide agents with nonverbal communications capability, engineers have started research on emotions recognition and facial expressions recognition. 2. Artists have begun to design and generate the reactions and behaviors of agents, to fill the gap between real human behaviors and those of computer agents.