Speech Technology Applications and Human-Machine Interface 2

This paper proposes a Computer Assisted Instruction (CAI) system that teaches students how to write Japanese characters. The most important feature of the system is the usage of synthesized speech to interact with users. The CAI system has a video display tablet interface. A user traces a pattern of a character using the tablet pen, and simultaneously his tracing is shown on the display. When the trace line is outside the pattern, the system simultaneously outputs synthesized speech to correct the errors. To design strategies for generating instructions, behavior and instruction messages of a human teacher were recorded and analyzed. One of the most interesting challenges of the system is a function that changes the "personality" of the teacher, such as a strict teacher, a friendly teacher, and a short-tempered teacher. According to the experimental results, it was confirmed that the proposed system makes it possible to convey a particular impression using synthesized speech.

SL980409.PDF (From Author) SL980409.PDF (Rasterized)

0409_01.WAV (was: 0409_01.wav)	The instruction messages (fourteen in total) of each set are the synthesized speech whose prosodic parameters are manually determined to express good mood, bad mood, and neutral mood, respectively (good mood [SOUND 0409\_1.WAV], bad mood [SOUND 0409\_2.WAV], neutral mood [SOUND 0409\_3.WAV]). File type: Sound File Format: Sound File: WAV Tech. description: 11025, 16, mono, PCM Creating Application:: Sound Forge4.0 Creating OS: Windows95
0409_02.WAV (was: 0409_02.wav)	The instruction messages (fourteen in total) of each set are the synthesized speech whose prosodic parameters are manually determined to express good mood, bad mood, and neutral mood, respectively (good mood [SOUND 0409\_1.WAV], bad mood [SOUND 0409\_2.WAV], neutral mood [SOUND 0409\_3.WAV]). File type: Sound File Format: Sound File: WAV Tech. description: 11025, 16, mono, PCM Creating Application:: Sound Forge4.0 Creating OS: Windows95
0409_03.WAV (was: 0409_03.wav)	The instruction messages (fourteen in total) of each set are the synthesized speech whose prosodic parameters are manually determined to express good mood, bad mood, and neutral mood, respectively (good mood [SOUND 0409\_1.WAV], bad mood [SOUND 0409\_2.WAV], neutral mood [SOUND 0409\_3.WAV]). File type: Sound File Format: Sound File: WAV Tech. description: 11025, 16, mono, PCM Creating Application:: Sound Forge4.0 Creating OS: Windows95

TOP

Using Combined Decisions and Confidence Measures for Name Recognition in Automatic Directory Assistance Systems

Authors:

Andreas Kellner, Philips Research Laboratories Aachen (Germany)
Bernhard Rueber, Philips Research Laboratories Aachen (Germany)
Hauke Schramm, Philips Research Laboratories Aachen (Germany)

Page (NA) Paper number 454

Abstract:

Directory assistance systems are amongst the most challenging applications of speech recognition. Today, complete automation of the service fails because of the lacking accuracy of current speech recognizers, which are simply not able to differentiate between hundreds of thousands or even millions of different names occurring in large cities. In this paper, we show that this situation can be remedied by systematically combining all available knowledge sources (last names, first names, street names, partly including their spelled versions) in a statistically optimal way. Especially designed confidence measures for N-best lists are proposed to detect misrecognized turns. Applying these techniques in a hierarchical setup is judged as the enabling step for automating large scale directory assistance. In first experiments, we e.g. are able to service 72% of the inquiries for a database of 1.3 million entries with a remaining error rate of only 6% (or 62% with an error rate of 2%).

SL980454.PDF (From Author) SL980454.PDF (Rasterized)

TOP

VPQ: A Spoken Language Interface to Large Scale Directory Information

Authors:

Bruce Buntschuh, AT&T Labs - Research (USA)
Candace A. Kamm, AT&T Labs - Research (USA)
Giuseppe Di Fabbrizio, AT&T Labs - Research (USA)
Alicia Abella, AT&T Labs - Research (USA)
Mehryar Mohri, AT&T Labs - Research (USA)
Shrikanth Narayanan, AT&T Labs - Research (USA)
I. Zeljkovic, AT&T Labs - Research (USA)
R.D. Sharp, AT&T Labs - Research (USA)
Jeremy H. Wright, AT&T Labs - Research (USA)
S. Marcus, International Asset Systems Ltd (USA)
J. Shaffer, Mississippi State University (USA)
R. Duncan, Mississippi State University (USA)
J.G. Wilpon, AT&T Labs - Research (USA)

Page (NA) Paper number 877

Abstract:

VPQ (Voice Post Query) is a dialog system for spoken access to information in the AT&T personnel database (>120,000 entries). An explicit design goal is for the initial interaction with the system to be rather unconstrained and to rely on tighter, prompt-constrained, dialog only when absolutely necessary. The purpose of VPQ is to explore and exploit the capabilities of ``state-of-the-art'' speech recognition systems for this high-perplexity task, and to develop the natural language understanding and dialog control components necessary for effective and efficient user interactions. The VPQ task includes simple interactions, where the initial request is unambiguous and the system's response provides the desired information or completes a call to the requested person, as well as more complex interactions where ambiguities or errors require dialog-driven resolution. This paper highlights the inherent challenges in this task, the major components of the system, the rationale for their design, and how they perform.

SL980877.PDF (From Author) SL980877.PDF (Rasterized)

TOP

SCAN - Speech Content Based Audio Navigator: A System Overview

Authors:

John Choi, AT&T Labs - Research (USA)
Don Hindle, AT&T Labs - Research (USA)
Julia Hirschberg, AT&T Labs - Research (USA)
Ivan Magrin-Chagnolleau, AT&T Labs - Research (USA)
Christine H. Nakatani, AT&T Labs - Research (USA)
Fernando Pereira, AT&T Labs - Research (USA)
Amit Singhal, AT&T Labs - Research (USA)
Steve Whittaker, AT&T Labs - Research (USA)

Page (NA) Paper number 604

Abstract:

SCAN (Speech Content based Audio Navigator) is a spoken document retrieval system integrating speaker-independent, large-vocabulary speech recognition with information-retrieval to support query-based retrieval of information from speech archives. Initial development focused on the application of SCAN to the broadcast news domain. This paper provides an overview of this system, including a description of its graphical user interface which incorporates machine-generated speech transcripts to provide local contextual navigation and random access for browsing large speech databases.

SL980604.PDF (From Author) SL980604.PDF (Rasterized)

TOP

Controlling a HIFI With a Continuous Speech Understanding System

Authors:

Javier Ferreiros, GTH-IEL-UPM (Spain)
José Colás, GTH-IEL-UPM (Spain)
Javier Macías-Guarasa, GTH-IEL-UPM (Spain)
Alejandro Ruiz, GTH-IEL-UPM (Spain)
José Manuel Pardo, GTH-IEL-UPM (Spain)

Page (NA) Paper number 988

Abstract:

In this paper we present a speech understanding system that accepts continuous speech sentences as input to command a HIFI set. The string of words obtained from the recogniser is sent to the understanding system that tries to fill in a set of frames specifying the triplet (SUBSYSTEM, PARAMETER, VALUE). All circumstances (understanding incompleteness, HIFI set status, result of the command execution) are confirmed back to the user via a text to speech system with substitutable-concept pattern-based generated messages. The understanding engine is based on semantic-like tagging, including "garbage" tag, and context-dependent rules for meaning extraction. The system allows the application developer to follow the reasoning process (as every understanding rule has an associated concept pattern), spoken by the speech generation module. The concepts for speech generation are randomly substituted with alternative expressions having the same meaning to achieve a certain degree of naturalness in the response speech.

SL980988.PDF (From Author) SL980988.PDF (Rasterized)

TOP

User Evaluation Of The Mask Kiosk

Authors:

Lori F. Lamel, LIMSI/CNRS (France)
Samir Bennacef, LIMSI/CNRS (France)
Jean-Luc Gauvain, LIMSI/CNRS (France)
Herve Dartigues, SNCF (France)
Jean-Noel Temem, SNCF (France)

Page (NA) Paper number 85

Abstract:

In this paper we report on a series of user trials carried out to assess the performance and usability of the MASK prototype kiosk. The aim of the ESPRIT Multimodal Multimedia Service Kiosk (MASK) project was to pave the way for more advanced public service applications with user interfaces employing multimodal, multi-media input and output. The prototype kiosk, was developed after analysis of the technological requirements in the context of users and the tasks they perform in carrying out travel enquiries, in close collaboration with the French Railways (SNCF) and the Ergonomics group at UCL. The time to complete the transaction with the MASK kiosk is reduced by about 30% compared to that required for the standard kiosk, and the success rate is 85% for novices and 94% once familiar with the system. In addition to meeting or exceeding the performance goals set at the project onset in terms of success rate, transaction time, and user satisfaction, the MASK kiosk was judged to be user-friendly and simple to use.

Speech Technology Applications and Human-Machine Interface 2

Authors:

Page (NA) Paper number 409

Abstract:

(was: 0409_01.wav)

(was: 0409_02.wav)

(was: 0409_03.wav)

Authors:

Page (NA) Paper number 454

Abstract:

Authors:

Page (NA) Paper number 877

Abstract:

Authors:

Page (NA) Paper number 604

Abstract:

Authors:

Page (NA) Paper number 988

Abstract:

Authors:

Page (NA) Paper number 85

Abstract: