Full List of Titles 1: ICSLP'98 Proceedings 2: SST Student Day Author Index A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Multimedia Files |
Steps Toward The Integration Of Speaker Recognition In Real-World Telecom ApplicationsAuthors:
Axel Glaeser, Ascom AG (Switzerland)
Page (NA) Paper number 878Abstract:The current market situation is characterized by a significant interest in speaker recognition functionalities in telecommunication systems (e.g. phone banking). This paper presents a field-test assessment of a speaker recognition algorithm, in a realistic context. Such field tests are particularly useful because the requirements for those real-world systems can be significantly different from those focused on by the research laboratories. Therefore, the results presented in this paper are divided into two groups. The quantitative ones describe the performance achieved in terms of Equal Error Rates (EER) as a function of the field-test conditions and different limitations on the enrollment and test duration. On the other hand, we discuss some innovative qualitative outcomes which are mainly based on non-technical but subjective impressions reported by the participants.
|
0692_XTR.ZIP(was: WinMSF.exe) | MSF player, this program can read and display multimedia speech file with lip-synchronized animated face File type: Executable Program File Format: Executable Program File: MS-Windows 32-bit Tech. description: None Creating Application:: Unknown Creating OS: Unknown |
0692_XTR.ZIP(was: Man_1106.img) | Image data library of man's facial images File type: Image File Format: OTHER Tech. description: Resolution 141 by 141, 8bit per pixel Creating Application:: Unknown Creating OS: MS-Windows95 |
0692_XTR.ZIP(was: Man.pal) | palette data for images File type: OTHER Format: OTHER Tech. description: 256 color palette used for demo images Creating Application:: Unknown Creating OS: MS-Windows95 |
0692_XTR.ZIP(was: changwon.msf) | msf file for Korean 'Changwon Univeristy'/chang won dae hak kyo/ File type: OTHER Format: OTHER Tech. description: msf multimedia file, 141 by 141, sound: sampling rate 16KHz, 16bit, mono Creating Application:: Unknown Creating OS: MS-Windows95 |
0692_XTR.ZIP(was: Spring1.msf) | msf file for Korean lyric song File type: OTHER Format: OTHER Tech. description: msf multimedia file, 141 by 141, sound: sampling rate 16KHz, 16bit, mono Creating Application:: Unknown Creating OS: MS-Windows95 |
Pernilla Qvarfordt, Department of Computer and Information Science, Linköping University (Sweden)
Arne Jönsson, Department of Computer and Information Science, Linköping University (Sweden)
Design of information systems where spatial and temporal information is merged and can be accessed using various modalities requires careful examination on how to combine the communication modalities to achieve efficient interaction. In this paper we present ongoing work on designing a multimodal interface with timetable information for local buses where the same database information can be accessed by different user categories with various information needs. The prototype interface was evaluated to investigate how speech contributes to the interaction. The results showed that the subjects used a more optimal sequence of actions when using speech, and did fewer errors. We also present suggestions for future design of multimodal interfaces.
Thomas Kemp, ISL, University of Karlsruhe (Germany)
Petra Geutner, ISL, University of Karlsruhe (Germany)
Michael Schmidt, ISL, University of Karlsruhe (Germany)
Borislav Tomaz, ISL, University of Karlsruhe (Germany)
Manfred Weber, ISL, University of Karlsruhe (Germany)
Martin Westphal, ISL, University of Karlsruhe (Germany)
Alex Waibel, ISL, University of Karlsruhe (Germany)
The recognition of broadcast news is a challenging problem in speech recognition. To achieve the long-term goal of robust, real-time news transcription, several problems have to be overcome, e.g. the variety of acoustic conditions and the unlimited vocabulary. Recently, a number of sites have been working on content-addressable multi-media information sources. In the presented paper, we focus on extending this work towards a multi-lingual environment, where queries and multimedia documents may appear in multiple languages. In cooperation with the Informedia project at CMU, we attempt to provide cross-lingual access to German and Serbo-Croatian newscasts.
Hyung-Jin Kim, Spoken Language Systems Group - MIT Laboratory for Computer Science (USA)
Lee Hetherington, Spoken Language Systems Group - MIT Laboratory for Computer Science (USA)
This paper describes seMole (se-mantic Mole), a robust framework for harvesting information from the World Wide Web. Unlike commercially available harvesting programs that use absolute addressing, seMole uses a semantic addressing scheme to gather information from HTML pages. Instead of relying on the HTML structure to locate data, semantic addressing relies on the relative position of key/value pairs to locate data. This scheme abstracts away from the underlying HTML structure of Web pages, allowing information gathering to only depend on the content of pages, which in large part does not change over time. We use this framework to gather information from various data sources including Boston Sidewalk and the CNN Weather Site. Through these experiments we find that seMole is more robust to changes in the Web sites and it is simpler to use and maintain than systems that use absolute addressing.
Lau Bakman, Aalborg University (Denmark)
Mads Blidegn, Aalborg University (Denmark)
Martin Wittrup, Aalborg University (Denmark)
Lars Bo Larsen, Aalborg University (Denmark)
Thomas B. Moeslund, Aalborg University (Denmark)
This paper describes an attempt to enhance a windows based (WIMP - Windows Icon Menu Pointer) environment. The goal is to establish whether user interaction on the common desktop PC can be augmented by adding new modalities to the WIMP interface, thus bridging the gap between todays interaction patterns and future interfaces comprising e.g. advanced conversational capabilities, VR technology, etc. A user survey was carried out to establish the trouble spots of the WIMP interface on the most common desktop work station, the Windows 95 PC. On the basis of this, a number of new modalities were considered. Spoken in- and output and gaze tracking were selected together with the concept of an interface agent for further investigation. A system was developed to control the interaction of the in- and output modalities, and set of five scenarios were constructed to test the proposed ideas. In these, a number of test subjects used the existing and added modalities in various configurations.
Christine H. Nakatani, AT&T Labs (USA)
Steve Whittaker, AT&T Labs (USA)
Julia Hirschberg, AT&T Labs (USA)
We present several studies that investigate how people use audio documents and uncover new principles for designing audio navigation technology. In particular, we report on an ethnographic study of voicemail users, exploring the behaviors and needs of users of current voicemail technology. To constrain design choices for better technology, we then study how people navigate through audio and how they perform basic information processing tasks on a voicemail corpus. Observations and analyses from the user experiments lead to new principles of design for audio document interfaces, which are embodied in a prototype structural audio browser. Specifically, we conclude that the reinforcement of audio memory and appropriate definition of content-based playback units are important properties of interfaces suited to human audio processing behaviors.
Rongyu Qiao, CSIRO (Australia)
Youngkyu Choi, CSIRO (Australia)
Johnson I. Agbinya, CSIRO (Australia)
A person verification system based on voice and facial images has been developed within CSIRO Telecommunications and Industrial Physics, Australia, for use in low-to-medium security systems. It provides a unique ID, which is non-intrusive, fast, and has no need for memorising passwords. A stand-alone version of the voice verifier has an error rate of less than 8%, while the face verifier has an error rate of less than 5%. By combining the two modules, an error rate of less than 1% is achieved. This paper describes in detail the method and some of the important practical issues in the implementation of the voice verifier. It also addresses the issue of decision making if the two sub-systems produce contradictory results.
Jordi Robert-Ribes, CSIRO-MIS (Australia)
This study analyses the possible use of automatic speech recognition (ASR) for the automatic captioning of TV programs. Captioning requires: (1) transcribing the spoken words and (2) determining the times at which the caption has to appear and disappear on the screen. These times have to match as closely as possible the corresponding times on the audio signal. Automatic speech recognition can be used to determine both aspects: the spoken words and their times. This paper focuses on the question: would perfect automatic speech recognition systems be able to automate the captioning process? We present quantitative data on the discrepancy between the audio signal and the manually generated captions. We show how ASR alone can even lower the efficiency of captioning. The techniques needed to automate the captioning process are presented.
Ben Serridge, Universidad de las Americas (Mexico)
This paper describes an undergraduate course in speech recognition, based on the CSLU Toolkit, which was taught at the Universidad de las Américas in Puebla, México. Throughout the course, laboratory assignments based on the toolkit guided students through the process of creating a recognizer, while in-class lectures consistently referred to the architecture of the toolkit as a concrete example of an existing system. The class was organized so that lectures and laboratory assignments followed the steps taken in the creation of a new recognizer. The students first recorded and labeled their own corpus, then proceeded to design and train neural network based recognizers, before finally testing for performance and creating sample applications. As a final project, students performed simple, well-defined experiments using the recognizers they had constructed. The CSLU Toolkit is freely available for non-commercial use from http://cslu.cse.ogi.edu/. In the future, similar courses based on the toolkit could be created and shared by many researchers in the speech community via the world-wide web.
Ping-Fai Yang, AT&T Laboratories -- Research (USA)
Yannis Stylianou, AT&T Laboratories -- Research (USA)
In this paper we present the application of a set of voice alteration algorithms based on Linear Prediction (LP). We present a study of some potential application areas of voice alteration technology and argue that near real time performance is a critical requirement for many. One benefit of our algorithms is their simplicity and therefore feasibility of implementation in real time system. To this end, we built an experimental platform on a personal computer. We also present our implementation and user experience from this effort.
Beng Tiong Tan, Vocalis Ltd. (U.K.)
Yong Gu, Vocalis Ltd. (U.K.)
Trevor Thomas, Vocalis Ltd. (U.K.)
This study investigates the utterance verification (UV) algorithm for a voice-activated dialing (VAD) system. The UV techniques help to improve the system accuracy of a VAD system and to improve the efficiency of user interface by reducing the need of confirmation. In this paper, we examine various UV methods, namely, all-phone garbage model (GM), N-best likelihood ratio (NBLR), and the combined methods. The performances of a VAD system with UV at various vocabulary sizes are studied. By rejecting 9.5% of correctly recognized names, the system error rate become less than 0.3% which represent a reduction of 91% in error rate over the baseline system. The UV technique can reduce the number of confirmation by at least 88% with a system error rate of 0.28%.
Hsin-Min Wang, Academia Sinica (Taiwan)
Bor-Shen Lin, National Taiwan University (Taiwan)
Berlin Chen, Academia Sinica (Taiwan)
Bo-Ren Bai, National Taiwan University (Taiwan)
Using voice memos in stead of text memos is believed to be more natural, convenient, and attractive. This paper presents a working Mandarin voice memo system that provides functions of automatic notification and voice retrieval. The main techniques include the content-based spoken document retrieval approach and the date-time expression detection and understanding approach. Extensive preliminary experiments were performed and encouraging results were demonstrated.