Speech Technology Applications and Human-Machine Interface 1

Home
Full List of Titles
1: ICSLP'98 Proceedings

Keynote Speeches

Text-To-Speech Synthesis 1

Spoken Language Models and Dialog 1

Prosody and Emotion 1

Hidden Markov Model Techniques 1

Speaker and Language Recognition 1

Multimodal Spoken Language Processing 1

Isolated Word Recognition

Robust Speech Processing in Adverse Environments 1

Spoken Language Models and Dialog 2

Articulatory Modelling 1

Talking to Infants, Pets and Lovers

Robust Speech Processing in Adverse Environments 2

Spoken Language Models and Dialog 3

Speech Coding 1

Articulatory Modelling 2

Prosody and Emotion 2

Neural Networks, Fuzzy and Evolutionary Methods 1

Utterance Verification and Word Spotting 1 / Speaker Adaptation 1

Text-To-Speech Synthesis 2

Spoken Language Models and Dialog 4

Human Speech Perception 1

Robust Speech Processing in Adverse Environments 3

Speech and Hearing Disorders 1

Prosody and Emotion 3

Spoken Language Understanding Systems 1

Signal Processing and Speech Analysis 1

Spoken Language Generation and Translation 1

Spoken Language Models and Dialog 5

Segmentation, Labelling and Speech Corpora 1

Multimodal Spoken Language Processing 2

Prosody and Emotion 4

Neural Networks, Fuzzy and Evolutionary Methods 2

Large Vocabulary Continuous Speech Recognition 1

Speaker and Language Recognition 2

Signal Processing and Speech Analysis 2

Prosody and Emotion 5

Robust Speech Processing in Adverse Environments 4

Segmentation, Labelling and Speech Corpora 2

Speech Technology Applications and Human-Machine Interface 1

Large Vocabulary Continuous Speech Recognition 2

Text-To-Speech Synthesis 3

Language Acquisition 1

Acoustic Phonetics 1

Speaker Adaptation 2

Speech Coding 2

Hidden Markov Model Techniques 2

Multilingual Perception and Recognition 1

Large Vocabulary Continuous Speech Recognition 3

Articulatory Modelling 3

Language Acquisition 2

Speaker and Language Recognition 3

Text-To-Speech Synthesis 4

Spoken Language Understanding Systems 4

Human Speech Perception 2

Large Vocabulary Continuous Speech Recognition 4

Spoken Language Understanding Systems 2

Signal Processing and Speech Analysis 3

Human Speech Perception 3

Speaker Adaptation 3

Spoken Language Understanding Systems 3

Multimodal Spoken Language Processing 3

Acoustic Phonetics 2

Large Vocabulary Continuous Speech Recognition 5

Speech Coding 3

Language Acquisition 3 / Multilingual Perception and Recognition 2

Segmentation, Labelling and Speech Corpora 3

Text-To-Speech Synthesis 5

Spoken Language Generation and Translation 2

Human Speech Perception 4

Robust Speech Processing in Adverse Environments 5

Text-To-Speech Synthesis 6

Speech Technology Applications and Human-Machine Interface 2

Prosody and Emotion 6

Hidden Markov Model Techniques 3

Speech and Hearing Disorders 2 / Speech Processing for the Speech and Hearing Impaired 1

Human Speech Production

Segmentation, Labelling and Speech Corpora 4

Speaker and Language Recognition 4

Speech Technology Applications and Human-Machine Interface 3

Utterance Verification and Word Spotting 2

Large Vocabulary Continuous Speech Recognition 6

Neural Networks, Fuzzy and Evolutionary Methods 3

Speech Processing for the Speech-Impaired and Hearing-Impaired 2

Prosody and Emotion 7
2: SST Student Day

SST Student Day - Poster Session 1

SST Student Day - Poster Session 2

Author Index
A B C D E F G H I
J K L M N O P Q R
S T U V W X Y Z

Multimedia Files

Steps Toward The Integration Of Speaker Recognition In Real-World Telecom Applications

Authors:

Axel Glaeser, Ascom AG (Switzerland)
Frédéric Bimbot, ENST - Dept. SIGNAL (France)

Page (NA) Paper number 878

Abstract:

The current market situation is characterized by a significant interest in speaker recognition functionalities in telecommunication systems (e.g. phone banking). This paper presents a field-test assessment of a speaker recognition algorithm, in a realistic context. Such field tests are particularly useful because the requirements for those real-world systems can be significantly different from those focused on by the research laboratories. Therefore, the results presented in this paper are divided into two groups. The quantitative ones describe the performance achieved in terms of Equal Error Rates (EER) as a function of the field-test conditions and different limitations on the enrollment and test duration. On the other hand, we discuss some innovative qualitative outcomes which are mainly based on non-technical but subjective impressions reported by the participants.

SL980878.PDF (From Author) SL980878.PDF (Rasterized)

TOP

A Bimodal Korean Address Entry/Retrieval System

Authors:

Hyun-Yeol Chung, Department of Information and Communication Engineering, Yeungnam University (Korea)
Cheol-Jun Hwang, Department of Information and Communication Engineering, Yeungnam University (Korea)
Shi-Wook Lee, Department of Information and Communication Engineering, Yeungnam University (Korea)

Page (NA) Paper number 460

Abstract:

This paper describes the development of a Korean address entry/retrieval system using bimodal input; speech recognition and touch sensitive display. The system works on a personal computer and employs automatic speech recognition and touch sensitive display techniques as user interface for input Korean address, which consisted with about 40,000 words. To meet the needs that practical speech recognition system should be worked in real time without any degradation of performance of recognition accuracy, 1)speaker and environmental adaptation by Maximum a posteriori(MAP) estimation were adopted for higher recognition and 2)fast search by tree-structured lexicon and frame synchronous beam search technique were employed for real time response. To offer more convenient user interface, touch sensitive display is also implemented. As the results, the system worked in 3 seconds after completion of address utterance with sentence recognition accuracy of above 96%.

SL980460.PDF (From Author) SL980460.PDF (Rasterized)

TOP

Usability Evaluation of IVR Systems With DTMF and ASR

Authors:

Cristina Delogu, Fondazione Ugo Bordoni (Italy)
Andrea Di Carlo, Fondazione Ugo Bordoni (Italy)
Paolo Rotundi, Fondazione Ugo Bordoni (Italy)
Danilo Sartori, Fondazione Ugo Bordoni (Italy)

Page (NA) Paper number 320

Abstract:

The paper presents an usability evaluation of 4 different prototypes of the same IVR application: three of them are automatic speech recognition (ASR) based and the other is Dual Tone Multi Frequency (DTMF) based. Our work consists of the automation of a service providing information about new facilities offered by Telecom Italia. The usability of the different prototypes has been evaluated through objective and subjective measures. Objective measures such as task completion and correctness, number of calls for task, transaction time, number of turns, and recognition accuracy have been obtained through the system's logfiles and the recorded speech utterances. To gather subjective measures the users were asked to fill in a questionnaire about their perception of the quality of the overall interaction, their effort in interacting with the system, and their satisfaction with different features of the system. In general a good correspondence between objective and subjective measures was found.

SL980320.PDF (From Author) SL980320.PDF (Rasterized)

TOP

SALSA Version 1.0: A Speech-Based Web Browser for Hong Kong English

Authors:

Pascale Fung, Human Language Technology Center (HLTC), University of Science and Technology (HKUST) (Hong Kong)
Chi Shun Cheung, Human Language Technology Center (HLTC), University of Science and Technology (HKUST) (Hong Kong)
Kwok Leung Lam, Human Language Technology Center (HLTC), University of Science and Technology (HKUST) (Hong Kong)
Wai Kat Liu, Human Language Technology Center (HLTC), University of Science and Technology (HKUST) (Hong Kong)
Yuen Yee Lo, Human Language Technology Center (HLTC), University of Science and Technology (HKUST) (Hong Kong)

Page (NA) Paper number 942

Abstract:

In this paper, we present a prototype speech-based Web browser, SALSA1.0, and describe some of the research issues we need to address while building this system for Hong Kong users. SALSA1.0 allows the user to speak English command words as well as partial or complete link names on any page. The research issues involved in building SALSA1.0 are mainly (1) how to handle large accent variations and mixed-language and (2) how to handle unknown words, especially proper names, in Web links. The recognition engine for SALSA1.0 is trained on WSJ data, and then retrained on a small amount of Hong Kong accent WSJ data to handle accent variations. An edit-distance algorithm is used to replace all unknown words by the closest known word in the word network for recognition. With these methods, link name recognition rate is at 91.20% for links without unknown words, and 82.40% for links with unknown words. SALSA is currently being developed into a multilingual, natural language-based Intranet service provider for HKUST campus information access.

SL980942.PDF (From Author) SL980942.PDF (Rasterized)

TOP

A Language for Creating Speech Applications

Authors:

Andrew Pargellis, Bell Labs, Lucent Technologies (USA)
Qiru Zhou, Bell Labs, Lucent Technologies (USA)
Antoine Saad, Bell Labs, Lucent Technologies (USA)
Chin-Hui Lee, Bell Labs, Lucent Technologies (USA)

Page (NA) Paper number 388

Abstract:

This paper describes an embedded Voice Interface Language (VIL) that enables the rapid prototyping and creation of applications requiring a voice interface. It can be integrated into popular script languages such as Perl or Tcl/Tk. Three levels of single-word commands enable the application designers to access basic speech processing technologies, such as automatic speech recognition and text-to-speech functions, without knowing details of the underlying technologies. VIL is a platform and domain independent speech application programming interface (API) that enables users to add a speech interface to their applications. The domain dependent components are defined by including a set of application specific arguments with each VIL command. Since the platform is an open architecture system, third party speech processing components may also be integrated into the platform and accessed by VIL.

SL980388.PDF (From Author) SL980388.PDF (Rasterized)

TOP

The Use of Automatic Speech Recognition to Reduce the Interference Between Concurrent Tasks of Driving and Phoning

Authors:

Robert Graham, HUSAT Research Institute, Loughborough University (U.K.)
Chris Carter, HUSAT Research Institute, Loughborough University (U.K.)
Brian Mellor, Speech Research Unit, The Defence Evaluation and Research Agency (DERA) Malvern (U.K.)

Page (NA) Paper number 516

Abstract:

Previous research has found that using manually-operated mobile phones while driving significantly increases the risk of a collision. It has been suggested that automatic speech recognition (ASR) interfaces may reduce the interference between the tasks of phoning and driving. A laboratory experiment was designed to examine this hypothesis, and also to investigate the optimal design for in-car ASR systems. Forty-eight participants dialled phone numbers from memory while carrying out a concurrent tracking task. Tracking performance was found to be adversely affected while using a manual phone. This effect was significantly reduced, although not eliminated, with a speech phone. Participants also perceived the mental workload of manual dialling while driving to be greater than speech dialling. A system of audio feedback was found to be marginally preferable to combined audio plus visual feedback. The recognition accuracy of the ASR device did not appear to have any significant bearing on driving performance nor acceptance. The results are encouraging for the use of speech interfaces in the car for phone and other functions.

SL980516.PDF (From Author) SL980516.PDF (Rasterized)

TOP

Interactive Listening to Structured Speech Content on the Internet

Authors:

Makoto J. Hirayama, Hewlett-Packard Laboratories Japan (Japan)
Taro Sugahara, Hewlett-Packard Laboratories Japan (Japan)
Zhiyong Peng, Hewlett-Packard Laboratories Japan (Japan)
Junichi Yamazaki, Hewlett-Packard Laboratories Japan (Japan)

Page (NA) Paper number 1058

Abstract:

Interactive information browsing of World Wide Web is a key application of the Internet and visual web browsers are widely used to access information. However, visual web browsing is not suitable in some circumstances such as in mobile environment. Therefore, we propose interactive listening to structured speech content for accessing information. Our proposed model of the interactive information listening services is that structured audio contents (HyperAudio) in a HyperAudio server are listened using a HyperAudio player whose appearance is similar to a portable radio. Unlike radio broadcasting programs, the HyperAudio contents have logical structures and hyperlinks so that listeners can listen desired information interactively. To put such logical structures into audio, a simple markup language was used. A prototype system of the HyperAudio server and players was implemented to test and evaluate feasibility and usability of the HyperAudio architecture.

SL981058.PDF (From Author) SL981058.PDF (Rasterized)

TOP

MSF Format For The Representation Of Speech Synchronized Moving Image

Authors:

Cheol-Woo Jo, Changwon National University (Korea)

Page (NA) Paper number 692

Abstract:

This paper describes the structure of a new multimedia file format. Also the procedures for implementing its encoder and the player are described. Multimedia Sound File(MSF) format reduced the size of the file. The display software is improved in the points that it requires only small sized image database compared to that of current similar programs require hugh amount of image database. This software tool can effectively display animated facial images and speech sounds together in synchronized form even at PC level. Implemented tool can be used as a plugin or an independent form. Encoder software is implemented to facilitate the production of the msf file. Files from the segmentation of speech signal into phonemic units are used as an input to the encoder.

SL980692.PDF (From Author) SL980692.PDF (Rasterized)

0692_XTR.ZIP (was: WinMSF.exe)	MSF player, this program can read and display multimedia speech file with lip-synchronized animated face File type: Executable Program File Format: Executable Program File: MS-Windows 32-bit Tech. description: None Creating Application:: Unknown Creating OS: Unknown
0692_XTR.ZIP (was: Man_1106.img)	Image data library of man's facial images File type: Image File Format: OTHER Tech. description: Resolution 141 by 141, 8bit per pixel Creating Application:: Unknown Creating OS: MS-Windows95
0692_XTR.ZIP (was: Man.pal)	palette data for images File type: OTHER Format: OTHER Tech. description: 256 color palette used for demo images Creating Application:: Unknown Creating OS: MS-Windows95
0692_XTR.ZIP (was: changwon.msf)	msf file for Korean 'Changwon Univeristy'/chang won dae hak kyo/ File type: OTHER Format: OTHER Tech. description: msf multimedia file, 141 by 141, sound: sampling rate 16KHz, 16bit, mono Creating Application:: Unknown Creating OS: MS-Windows95
0692_XTR.ZIP (was: Spring1.msf)	msf file for Korean lyric song File type: OTHER Format: OTHER Tech. description: msf multimedia file, 141 by 141, sound: sampling rate 16KHz, 16bit, mono Creating Application:: Unknown Creating OS: MS-Windows95

TOP

Effects of Using Speech in Timetable Information Systems for WWW

Authors:

Pernilla Qvarfordt, Department of Computer and Information Science, Linköping University (Sweden)
Arne Jönsson, Department of Computer and Information Science, Linköping University (Sweden)

Page (NA) Paper number 477

Abstract:

Design of information systems where spatial and temporal information is merged and can be accessed using various modalities requires careful examination on how to combine the communication modalities to achieve efficient interaction. In this paper we present ongoing work on designing a multimodal interface with timetable information for local buses where the same database information can be accessed by different user categories with various information needs. The prototype interface was evaluated to investigate how speech contributes to the interaction. The results showed that the subjects used a more optimal sequence of actions when using speech, and did fewer errors. We also present suggestions for future design of multimodal interfaces.

SL980477.PDF (From Author) SL980477.PDF (Rasterized)

TOP

The Interactive Systems Labs View4You Video Indexing System

Authors:

Thomas Kemp, ISL, University of Karlsruhe (Germany)
Petra Geutner, ISL, University of Karlsruhe (Germany)
Michael Schmidt, ISL, University of Karlsruhe (Germany)
Borislav Tomaz, ISL, University of Karlsruhe (Germany)
Manfred Weber, ISL, University of Karlsruhe (Germany)
Martin Westphal, ISL, University of Karlsruhe (Germany)
Alex Waibel, ISL, University of Karlsruhe (Germany)

Page (NA) Paper number 759

Abstract:

The recognition of broadcast news is a challenging problem in speech recognition. To achieve the long-term goal of robust, real-time news transcription, several problems have to be overcome, e.g. the variety of acoustic conditions and the unlimited vocabulary. Recently, a number of sites have been working on content-addressable multi-media information sources. In the presented paper, we focus on extending this work towards a multi-lingual environment, where queries and multimedia documents may appear in multiple languages. In cooperation with the Informedia project at CMU, we attempt to provide cross-lingual access to German and Serbo-Croatian newscasts.

SL980759.PDF (From Author) SL980759.PDF (Rasterized)

TOP

SEMOLE: A Robust Framework For Gathering Information From The World Wide Web

Authors:

Hyung-Jin Kim, Spoken Language Systems Group - MIT Laboratory for Computer Science (USA)
Lee Hetherington, Spoken Language Systems Group - MIT Laboratory for Computer Science (USA)

Page (NA) Paper number 1076

Abstract:

This paper describes seMole (se-mantic Mole), a robust framework for harvesting information from the World Wide Web. Unlike commercially available harvesting programs that use absolute addressing, seMole uses a semantic addressing scheme to gather information from HTML pages. Instead of relying on the HTML structure to locate data, semantic addressing relies on the relative position of key/value pairs to locate data. This scheme abstracts away from the underlying HTML structure of Web pages, allowing information gathering to only depend on the content of pages, which in large part does not change over time. We use this framework to gather information from various data sources including Boston Sidewalk and the CNN Weather Site. Through these experiments we find that seMole is more robust to changes in the Web sites and it is simpler to use and maintain than systems that use absolute addressing.

SL981076.PDF (From Author) SL981076.PDF (Rasterized)

TOP

Enhancing a WIMP Based Interface With Speech, Gaze Tracking and Agents

Authors:

Lau Bakman, Aalborg University (Denmark)
Mads Blidegn, Aalborg University (Denmark)
Martin Wittrup, Aalborg University (Denmark)
Lars Bo Larsen, Aalborg University (Denmark)
Thomas B. Moeslund, Aalborg University (Denmark)

Page (NA) Paper number 766

Abstract:

This paper describes an attempt to enhance a windows based (WIMP - Windows Icon Menu Pointer) environment. The goal is to establish whether user interaction on the common desktop PC can be augmented by adding new modalities to the WIMP interface, thus bridging the gap between todays interaction patterns and future interfaces comprising e.g. advanced conversational capabilities, VR technology, etc. A user survey was carried out to establish the trouble spots of the WIMP interface on the most common desktop work station, the Windows 95 PC. On the basis of this, a number of new modalities were considered. Spoken in- and output and gaze tracking were selected together with the concept of an interface agent for further investigation. A system was developed to control the interaction of the in- and output modalities, and set of five scenarios were constructed to test the proposed ideas. In these, a number of test subjects used the existing and added modalities in various configurations.

SL980766.PDF (From Author) SL980766.PDF (Rasterized)

TOP

Now You Hear It, Now You Don't: Empirical Studies of Audio Browsing Behavior Behavior

Authors:

Christine H. Nakatani, AT&T Labs (USA)
Steve Whittaker, AT&T Labs (USA)
Julia Hirschberg, AT&T Labs (USA)

Page (NA) Paper number 1003

Abstract:

We present several studies that investigate how people use audio documents and uncover new principles for designing audio navigation technology. In particular, we report on an ethnographic study of voicemail users, exploring the behaviors and needs of users of current voicemail technology. To constrain design choices for better technology, we then study how people navigate through audio and how they perform basic information processing tasks on a voicemail corpus. Observations and analyses from the user experiments lead to new principles of design for audio document interfaces, which are embodied in a prototype structural audio browser. Specifically, we conclude that the reinforcement of audio memory and appropriate definition of content-based playback units are important properties of interfaces suited to human audio processing behaviors.

SL981003.PDF (From Author) SL981003.PDF (Rasterized)

TOP

A Voice Verifier for Face/Voice Based Person Verification System

Authors:

Rongyu Qiao, CSIRO (Australia)
Youngkyu Choi, CSIRO (Australia)
Johnson I. Agbinya, CSIRO (Australia)

Page (NA) Paper number 307

Abstract:

A person verification system based on voice and facial images has been developed within CSIRO Telecommunications and Industrial Physics, Australia, for use in low-to-medium security systems. It provides a unique ID, which is non-intrusive, fast, and has no need for memorising passwords. A stand-alone version of the voice verifier has an error rate of less than 8%, while the face verifier has an error rate of less than 5%. By combining the two modules, an error rate of less than 1% is achieved. This paper describes in detail the method and some of the important practical issues in the implementation of the voice verifier. It also addresses the issue of decision making if the two sub-systems produce contradictory results.

SL980307.PDF (From Author) SL980307.PDF (Rasterized)

TOP

On The Use Of Automatic Speech Recognition For TV Captioning

Authors:

Jordi Robert-Ribes, CSIRO-MIS (Australia)

Page (NA) Paper number 621

Abstract:

This study analyses the possible use of automatic speech recognition (ASR) for the automatic captioning of TV programs. Captioning requires: (1) transcribing the spoken words and (2) determining the times at which the caption has to appear and disappear on the screen. These times have to match as closely as possible the corresponding times on the audio signal. Automatic speech recognition can be used to determine both aspects: the spoken words and their times. This paper focuses on the question: would perfect automatic speech recognition systems be able to automate the captioning process? We present quantitative data on the discrepancy between the audio signal and the manually generated captions. We show how ASR alone can even lower the efficiency of captioning. The techniques needed to automate the captioning process are presented.

SL980621.PDF (From Author) SL980621.PDF (Rasterized)

TOP

An Undergraduate Course on Speech Recognition Based on the CSLU Toolkit

Authors:

Ben Serridge, Universidad de las Americas (Mexico)

Page (NA) Paper number 925

Abstract:

This paper describes an undergraduate course in speech recognition, based on the CSLU Toolkit, which was taught at the Universidad de las Américas in Puebla, México. Throughout the course, laboratory assignments based on the toolkit guided students through the process of creating a recognizer, while in-class lectures consistently referred to the architecture of the toolkit as a concrete example of an existing system. The class was organized so that lectures and laboratory assignments followed the steps taken in the creation of a new recognizer. The students first recorded and labeled their own corpus, then proceeded to design and train neural network based recognizers, before finally testing for performance and creating sample applications. As a final project, students performed simple, well-defined experiments using the recognizers they had constructed. The CSLU Toolkit is freely available for non-commercial use from http://cslu.cse.ogi.edu/. In the future, similar courses based on the toolkit could be created and shared by many researchers in the speech community via the world-wide web.

SL980925.PDF (From Author) SL980925.PDF (Rasterized)

TOP

Real Time Voice Alteration Based on Linear Prediction

Authors:

Ping-Fai Yang, AT&T Laboratories -- Research (USA)
Yannis Stylianou, AT&T Laboratories -- Research (USA)

Page (NA) Paper number 814

Abstract:

In this paper we present the application of a set of voice alteration algorithms based on Linear Prediction (LP). We present a study of some potential application areas of voice alteration technology and argue that near real time performance is a critical requirement for many. One benefit of our algorithms is their simplicity and therefore feasibility of implementation in real time system. To this end, we built an experimental platform on a personal computer. We also present our implementation and user experience from this effort.

SL980814.PDF (From Author) SL980814.PDF (Rasterized)

TOP

Evaluation and Implementation of a Voice-Activated Dialing System with Utterance Verification

Authors:

Beng Tiong Tan, Vocalis Ltd. (U.K.)
Yong Gu, Vocalis Ltd. (U.K.)
Trevor Thomas, Vocalis Ltd. (U.K.)

Page (NA) Paper number 480

Abstract:

This study investigates the utterance verification (UV) algorithm for a voice-activated dialing (VAD) system. The UV techniques help to improve the system accuracy of a VAD system and to improve the efficiency of user interface by reducing the need of confirmation. In this paper, we examine various UV methods, namely, all-phone garbage model (GM), N-best likelihood ratio (NBLR), and the combined methods. The performances of a VAD system with UV at various vocabulary sizes are studied. By rejecting 9.5% of correctly recognized names, the system error rate become less than 0.3% which represent a reduction of 91% in error rate over the baseline system. The UV technique can reduce the number of confirmation by at least 88% with a system error rate of 0.28%.

SL980480.PDF (From Author) SL980480.PDF (Rasterized)

TOP

Towards a Mandarin Voice Memo System

Authors:

Hsin-Min Wang, Academia Sinica (Taiwan)
Bor-Shen Lin, National Taiwan University (Taiwan)
Berlin Chen, Academia Sinica (Taiwan)
Bo-Ren Bai, National Taiwan University (Taiwan)

Page (NA) Paper number 192

Abstract:

Using voice memos in stead of text memos is believed to be more natural, convenient, and attractive. This paper presents a working Mandarin voice memo system that provides functions of automatic notification and voice retrieval. The main techniques include the content-based spoken document retrieval approach and the date-time expression detection and understanding approach. Extensive preliminary experiments were performed and encouraging results were demonstrated.

SL980192.PDF (From Author) SL980192.PDF (Rasterized)

TOP