ICSLP'98 Multimodal Spoken Language Processing 1

Multimodal Spoken Language Processing 1
Home Full List of Titles 1: ICSLP'98 Proceedings Keynote Speeches Text-To-Speech Synthesis 1 Spoken Language Models and Dialog 1 Prosody and Emotion 1 Hidden Markov Model Techniques 1 Speaker and Language Recognition 1 Multimodal Spoken Language Processing 1 Isolated Word Recognition Robust Speech Processing in Adverse Environments 1 Spoken Language Models and Dialog 2 Articulatory Modelling 1 Talking to Infants, Pets and Lovers Robust Speech Processing in Adverse Environments 2 Spoken Language Models and Dialog 3 Speech Coding 1 Articulatory Modelling 2 Prosody and Emotion 2 Neural Networks, Fuzzy and Evolutionary Methods 1 Utterance Verification and Word Spotting 1 / Speaker Adaptation 1 Text-To-Speech Synthesis 2 Spoken Language Models and Dialog 4 Human Speech Perception 1 Robust Speech Processing in Adverse Environments 3 Speech and Hearing Disorders 1 Prosody and Emotion 3 Spoken Language Understanding Systems 1 Signal Processing and Speech Analysis 1 Spoken Language Generation and Translation 1 Spoken Language Models and Dialog 5 Segmentation, Labelling and Speech Corpora 1 Multimodal Spoken Language Processing 2 Prosody and Emotion 4 Neural Networks, Fuzzy and Evolutionary Methods 2 Large Vocabulary Continuous Speech Recognition 1 Speaker and Language Recognition 2 Signal Processing and Speech Analysis 2 Prosody and Emotion 5 Robust Speech Processing in Adverse Environments 4 Segmentation, Labelling and Speech Corpora 2 Speech Technology Applications and Human-Machine Interface 1 Large Vocabulary Continuous Speech Recognition 2 Text-To-Speech Synthesis 3 Language Acquisition 1 Acoustic Phonetics 1 Speaker Adaptation 2 Speech Coding 2 Hidden Markov Model Techniques 2 Multilingual Perception and Recognition 1 Large Vocabulary Continuous Speech Recognition 3 Articulatory Modelling 3 Language Acquisition 2 Speaker and Language Recognition 3 Text-To-Speech Synthesis 4 Spoken Language Understanding Systems 4 Human Speech Perception 2 Large Vocabulary Continuous Speech Recognition 4 Spoken Language Understanding Systems 2 Signal Processing and Speech Analysis 3 Human Speech Perception 3 Speaker Adaptation 3 Spoken Language Understanding Systems 3 Multimodal Spoken Language Processing 3 Acoustic Phonetics 2 Large Vocabulary Continuous Speech Recognition 5 Speech Coding 3 Language Acquisition 3 / Multilingual Perception and Recognition 2 Segmentation, Labelling and Speech Corpora 3 Text-To-Speech Synthesis 5 Spoken Language Generation and Translation 2 Human Speech Perception 4 Robust Speech Processing in Adverse Environments 5 Text-To-Speech Synthesis 6 Speech Technology Applications and Human-Machine Interface 2 Prosody and Emotion 6 Hidden Markov Model Techniques 3 Speech and Hearing Disorders 2 / Speech Processing for the Speech and Hearing Impaired 1 Human Speech Production Segmentation, Labelling and Speech Corpora 4 Speaker and Language Recognition 4 Speech Technology Applications and Human-Machine Interface 3 Utterance Verification and Word Spotting 2 Large Vocabulary Continuous Speech Recognition 6 Neural Networks, Fuzzy and Evolutionary Methods 3 Speech Processing for the Speech-Impaired and Hearing-Impaired 2 Prosody and Emotion 7 2: SST Student Day SST Student Day - Poster Session 1 SST Student Day - Poster Session 2 Author Index A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Multimedia Files	A Fast Method of Producing Talking Head Mouth Shapes from Real Speech Authors: Andrew P. Breen, BT Labs, Ecole Nationale Superieure des Telecommunications de Bretagne (France) O. Gloaguen, BT Labs, Ecole Nationale Superieure des Telecommunications de Bretagne (France) P. Stern, BT Labs, Ecole Nationale Superieure des Telecommunications de Bretagne (France) Page (NA) Paper number 390 Abstract: The subject of computer generated virtual characters is a diverse and rapidly developing field, with a wide variety of applications in industries as varied as entertainment, education and advertising. Many of these applications require or would be greatly enhanced by having the virtual characters speak with the recorded voice of a real person. Such an ability is particularly useful in applications where users are interacting via avatars in real time in a virtual world. There are three basic problems which need to be addressed when developing an interface which has this functionality. ) The process must be capable of animating mouth shapes in real time. ) The process should not mouth extraneous sounds such as music, doors slamming etc. To do so would diminish the effectiveness of the illusion. ) The mouth shapes produced by the avatar should approximate that of the speaker. This paper describes a series of experiments which attempt to address each of the points outlined above. The experimental procedures are based around a real time low computation approach which relies on a particular variety of neural network known as the Single Layer Look Up Perceptron (SLLUP). SL980390.PDF (Scanned) TOP The Efficiency of Multimodal Interaction: a Case Study Authors: Phil R. Cohen, Oregon Graduate Institute (USA)* Michael Johnston, Oregon Graduate Institute (USA) David McGee, Oregon Graduate Institute (USA) Sharon L. Oviatt, Oregon Graduate Institute (USA) Joshua Clow, Oregon Graduate Institute (USA) Ira Smith, Oregon Graduate Institute (USA) Page (NA) Paper number 571 Abstract: This paper reports on a case study comparison of a direct-manipulation-based graphical user interface (GUI) with the QuickSet pen/voice multimodal interface for supporting the task of military force "laydown." In this task, a user places military units and "control measures," such as various types of lines, obstacles, objectives, etc., on a map. A military expert designed his own scenario and entered it via both interfaces. Usage of QuickSet led to a speed improvement of 3.2 to 8.7-fold, depending on the kind of object being created. These results suggest that there may be substantial efficiency advantages to using multimodal interaction over GUIs for map-based tasks. SL980571.PDF (From Author) SL980571.PDF (Rasterized) TOP Audio and Audio-visual Perception of Consonants Disturbed by White Noise and 'Cocktail Party' Authors: László Czap, University of Miskolc, Department of Automation (Hungary) Page (NA) Paper number 445 Abstract: Some research questions regarding the speech perception can only be answered with natural speech stimuli especially in noisy environment. In this paper we are going to answer a couple of questions on visual support of audio signal at speech recognition. How much support can give the video signal for the audio one? The impact of nature of the noise. How can help the visual information to identify the place of articulation? Does the voices of different class of excitation get the same visual support? In order to answer these questions we have performed intelligibility study on consonants between the same vowel supported or not by the speaker's image with different signal to noise ratio. The noise is either white noise or a mix of other speakers' voice. SL980445.PDF (From Author) SL980445.PDF (Rasterized) TOP Overview of the Maya Spoken Language System Authors: Simon Downey, BT Labs (U.K.) Andrew P. Breen, BT Labs (U.K.) Maria Fernandez, BT Labs (U.K.) Edward Kaneen, BT Labs (U.K.) Page (NA) Paper number 391 Abstract: Recent developments in distributed system processing have opened the doors to the running of highly complex systems over a number of networked computers. This enables the complexity of a system to be hidden behind a small, lightweight user interface - for example a downloadable web page. The Maya system makes use of such interfaces to combine the functionality of speech recognition, synthesis, robust parsing, text generation and dialogue management into a highly flexible multimodal architecture, working in real time. This paper describes the development of the architecture and interfaces to each system component. The configuration of the system to particular tasks is discussed, making use of an email secretary task as an example. Once configured, the system is able to adopt all the functionality of a conventional email system and extend these capabilities by allowing complex queries to be made about mail messages. SL980391.PDF (From Author) SL980391.PDF (Rasterized) TOP Automatic Recognition of Spontaneous Speech Dialogues Authors: Mauro Cettolo, ITC-IRST (Italy) Daniele Falavigna, ITC-IRST (Italy) Page (NA) Paper number 332 Abstract: The work reported in the paper will concern the assessment of a set of modifications applied to the continuous speech recognizer developed at IRST for dictation tasks. The objective of the proposed modifications is to improve the recognizer performance on a corpus of human-human dialogues, spontaneously uttered. Some solutions will be given to increase Automatic Speech Recognition (ASR) robustness with respect to typical spontaneous speech phenomena such as: breaths, coughs, filled and silent pauses and speaking rate variations. Both gender independent and dependent models are used. Specific models of extra-linguistic phenomena are trained and a method for coping with speaking rate variations will be proposed. Different recognizers, corresponding to males, females and to various speaking rate factors, are combined together so as a unique search space is defined. Best performance, obtained on a corpus of hundreds of person-to-person spontaneously uttered dialogues, gives 26.1% Word Error Rate (WER). SL980332.PDF (From Author) SL980332.PDF (Rasterized) TOP Using an Animated Talking Character in a Web-Based City Guide Demonstrator Authors: Georg Fries, Deutsche Telekom Berkom GmbH (Germany) Stefan Feldes, Deutsche Telekom Berkom GmbH (Germany) Alfred Corbet, Deutsche Telekom Berkom GmbH (Germany) Page (NA) Paper number 1024 Abstract: Taking into account that the user acceptance of an animated agent is influenced by different criteria, including appropriateness of the application domain, quality of application design, and quality of character design, we have developed a demo application in a domain where we found the agent substantially helpful - a web-based city guide. The animated character interactively guides the user through some sights of the city of Darmstadt. It can display and explain how to get to different places of interest by moving around and pointing at locations on a map. The system allows input via mouse clicks, speech and typed text. Output modalities of the agent are speech, gesture, text (cartoon word balloons) and some facial expression. Further we have designed two new 3D characters. We describe experiences gained during system development and discuss design aspects concerning application as well as character animation. SL981024.PDF (From Author) SL981024.PDF (Rasterized) TOP Influence of Facial Views on the McGurk Effect in Auditory Noise Authors: Rika Kanzaki, ATR Human Information Processing Research Laboratories (Japan) Takashi Kato, Kansai University (Japan) Page (NA) Paper number 243 Abstract: To investigate the nature of facial information involved in the integration of audiovisual speech perception, we examined the influence of facial views on the McGurk effect under two auditory conditions. While the speech perception of most of the audiovisual syllables used was little affected by the facial views and the auditory noise, a stronger McGurk effect was obtained for the 3/4-view image uttering labial sounds presented with auditory syllables of alveolars under the auditory noise. However, the facial view did not affect the visual identification of labials in the same way. These results suggest that the information about labial or nonlabial is not the only facial information involved in the McGurk effect. It appears some other information available only in the 3/4-view image might also be involved in the McGurk effect. Implications for the processing of visual, auditory and audiovisual speech are discussed. SL980243.PDF (From Author) SL980243.PDF (Rasterized) TOP The Intellimedia Workbench - a Generic Environment for Multimodal Systems Authors: Tom Brøndsted, Aalborg University (Denmark) Lars Bo Larsen, Aalborg University (Denmark) Michael Manthey, Aalborg University (Denmark) Paul McKevitt, Aalborg University (Denmark) Thomas B. Moeslund, Aalborg University (Denmark) Kristian G. Olesen, Aalborg University (Denmark) Page (NA) Paper number 767 Abstract: The present paper presents a generic environment for intelligent multi media applications, denoted "The Intellimedia Work-Bench". The aim of the workbench is to facilitate development and research within the field of multi modal user interaction. Physically it is a table with various devices mounted above and around. These include: A camera and a laser projector mounted above the workbench, a microphone array mounted on the walls of the room, a speech recogniser and a speech synthesiser. The camera is attached to a vision system capable of locating various objects placed on the workbench. The paper presents two applications utilising the workbench. One is a campus information system, allowing the user to ask for directions within a part of the university campus. The second application is a pool trainer, intended to provide guidance to novice players. SL980767.PDF (From Author) SL980767.PDF (Rasterized) TOP STAMP: A Suite Of Tools For Analyzing Multimodal System Processing Authors: Joshua Clow, Oregon Graduate Institute (USA) Sharon L. Oviatt, Oregon Graduate Institute (USA) Page (NA) Paper number 50 Abstract: In this paper we describe a new automated suite of tools for capturing and analyzing data on multimodal systems called STAMP.1 STAMP is designed to support research and development efforts for advancing next-generation multimodal systems. STAMP permits researchers to analyze multimodal system performance by: (1) recording data on users' multimodal input and the system's responding, (2) supporting flexible replay of these multimodal commands, along with n-best recognition lists for the individual modalities and their combined multimodal interpretation, and (3) supporting automated analysis using different metrics of multimodal system performance. This collection of tools currently is being used to conduct basic research on the characteristics of multimodal systems, and also to iterate different aspects of the Quickset multimodal architecture. SL980050.PDF (From Author) SL980050.PDF (Rasterized) TOP Cultural Similarities and Differences in the Recognition of Audio-Visual Speech Stimuli Authors: Sumi Shigeno, Kitasato University (Japan) Page (NA) Paper number 1057 Abstract: Cultural similarities and differences were compared between Japanese and North American subjects in the recognition of emotion. Seven native Japanese and five native North Americans (four Americans and one Canadian) subjects participated in the experiments. The materials were five meaningful words or short-sentences in Japanese and American English. Japanese and American actors made vocal and facial expression in order to transmit six basic emotions- happiness, surprise, anger, disgust, fear, and sadness. Three presentation conditions were used-auditory, visual, and audio-visual. The audio-visual stimuli were made by dubbing the auditory stimuli on to the visual stimuli. The results show: (1) subjects can more easily recognize the vocal expression of a speaker who belongs to their own culture, (2) Japanese subjects are not good at recognizing 'fear' in both the auditory-alone and visual-alone conditions, (3) and both Japanese and American subjects identify the audio-visually incongruent stimuli more often as a visual label rather than as an auditory label. These results suggest that it is difficult to identify the emotion of a speaker from a different culture and that people will predominantly use visual information to identify emotion. SL981057.PDF (From Author) SL981057.PDF (Rasterized) TOP A Multimodal-Input Multimedia-Output Guidance System: MMGS Authors: Toshiyuki Takezawa, ATR Interpreting Telecommunications Research Laboratories (Japan) Tsuyoshi Morimoto, Fukuoka University (Japan) Page (NA) Paper number 958 Abstract: We have built a multimodal-input multimedia-output guidance system called MMGS. The input of a user can be a combination of speech and hand-written gestures. The system, on the other hand, outputs a response that combines speech, three-dimensional graphics, and/or other information. This system can interact cooperatively with the user by resolving ellipses/anaphora and various ambiguities such as those caused by speech recognition errors. It is currently implemented on a SGI workstation and achieves nearly real-time processing. SL980958.PDF (From Author) SL980958.PDF (Rasterized) TOP HMM-based Visual Speech Recognition Using Intensity and Location Normalization Authors: Oscar Vanegas, Nagoya Institute of Technology (Japan) Akiji Tanaka, Nagoya Institute of Technology (Japan) Keiichi Tokuda, Nagoya Institute of Technology (Japan) Tadashi Kitamura, Nagoya Institute of Technology (Japan) Page (NA) Paper number 789 Abstract: This paper describes intensity and location normalization techniques for improving the performance of visual speech recognizers used in audio-visual speech recognition. For auditory speech recognition, there exist many methods for dealing with channel characteristics and speaker individualities, e.g., CMN (cepstral mean normalization), SAT (speaker adaptive training). We present two techniques similar to CMN and SAT, respectively, for intensity and location normalization in visual speech recognition. For the intensity normalization, the mean value over the image sequence is subtracted from each pixel of the image secuence. For the location normalization, the training and the testing processes are carried out by finding the lip position with the highest likelihood of each utterance for HMMs. Word recognition experiments based on HMM show that a significant improvement in recognition performance is achieved by combining the two techniques. SL980789.PDF (From Author) SL980789.PDF (Rasterized) TOP A Hierarchy Probability-Based Visual Features Extraction Method for Speechreading Authors: Yanjun Xu, Inst Acoustics, Chinese Acad Sci (China) Limin Du, Inst Acoustics, Chinese Acad Sci (China) Guoqiang Li, Inst Acoustics, Chinese Acad Sci (China) Ziqiang Hou, Inst Acoustics, Chinese Acad Sci (China) Page (NA) Paper number 187 Abstract: Visual feature extraction method now becomes the key technique in automatic speechreading systems. However it still remains a difficult problem due to large inter-person and intra-person appearance variabilities. In this paper, we extend the normal active shape model to a hierarchy probability-based framework, which can model a complex shape, such as human face. It decomposes the complex shape into two layers: the global shape including the position, scale and rotation of local shapes (such as eyes, nose, mouth and chin); the local simple shape in normal form. The two layers describe the global variation and local variation respectively, and are combined into a probability framework. It can perform fully automatic facial features locating in speechreading, or face recognition. SL980187.PDF (From Author) SL980187.PDF (Rasterized) TOP Integration Of Talking Heads And Text-To-Speech Synthesizers For Visual TTS Authors: Jörn Ostermann, AT&T Labs - Research (USA) Mark C. Beutnagel, AT&T Labs - Research (USA) Ariel Fischer, Eurecom/EPFL (France) Yao Wang, Polytechnic University, Brooklyn (USA) Page (NA) Paper number 931 Abstract: The integration of text-to-speech (TTS) synthesis and animation of synthetic faces allows new applications like visual human computer interfaces using agents or avatars. The TTS informs the talking head when phonemes are spoken. The appropriate mouth shapes are animated and rendered while the TTS produces the sound. We call this integrated system of TTS and animation a Visual TTS (VTTS). This paper describes the architecture on an integrated VTTS synthesizer that allows defining facial expressions as bookmarks in the text that will be animated while the model is talking. The position of a bookmark in the text defines the start time for the facial expression. The bookmark itself names the expression, its amplitude and the duration during which the amplitude has to be reached by the face. A bookmark to face animation parameter (FAP) converter creates a curve defining the amplitude for the given FAP over time using Hermite functions of 3rd order [http://www.research.att.com/info/osterman]. SL980931.PDF (From Author) SL980931.PDF (Rasterized) TOP