Session WMB Multimodal Speech Processing, Emerging Techniques and Applications

Chairperson Giorgio Micca CSELT, Italy

Home


FUZZY LOGIC FOR RULE-BASED FORMANT SPEECH SYNTHESIS

Authors: S. Raptis and G. Carayannis

Speech Synthesis Team, Institute for Language and Speech Processing, 22 Margari St., 1 15 25, Athens, Greece. Tel. +30 I 6712250, Fax: +30 1 6741262, E-mail: spy@ilsp.gr

Volume 3 pages 1599 - 1602

ABSTRACT

Fuzzy set theory and fuzzy logic has been initiated by Zadeh back in 1965 to pennit the treatment of vague, imprecise, and ill-defined knowledge in an concise manner. One of the unique advantages of fuzzy logic is that it is capable of directly incorporating and utilizing qualitative and heuristic knowledge in the form of causal if then production rules for reasoning and inference. On the other hand, rule-based speech synthesis based on formants makes considerable use of rules for numerous of the tasks it involves, e.g. graphemic to phonemic transcription, coarticulation, concatenation, and duration rules etc. These rules also take the if then form with their antecedent (condition) part describing the context of the rule and their decedent an appropriate action to be taken. The main motivation for introducing fuzzy logic in the synthesis-by-rule paradigm, is its ability to host and treat uncertainty and imprecision both in the condition part of the rule as well as its decedent part. This may be argued to significantly reduce the number of required rules while rendering them more meaningful and human-like.

A0029.pdf

TOP


INTEGRATING ACOUSTIC AND LABIAL INFORMATION FOR SPEAKER IDENTIFICATION AND VERIFICATION

Authors: Pierre Jourlin (1),(2), Juergen Luettin (1), Dominique Genoud (1), Hubert Wassner (1)

(1) IDIAP, rue du Silnplon 4, (CP 592, (CH-19Z0 Martigny, Switzerland (2) LIA, 339 chemin des Meinajaries, BP lZZ8, 84911 Avignon Gedex 9, France jourlin@univ-avignon.fr, (luettin, genoud)@idiap.ch, wassner@ensta.fr

Volume 3 pages 1603 - 1606

ABSTRACT

This paper describes a multimodal approach for speaker verification. The system consists of two classifiers, one using visual features and the other using acoustic features. A lip tracker is used to extract visual information from the speaking face which provides shape and intensity features. We describe an approach for normalizing and mapping different modalities onto a common confidence interval. We also describe a novel method for integrating the scores of multiple classifiers. Verification experiments are reported for the individual modalities and for the combined classifier. The performance of the integrated system outperformed each sub-system and reduced the false acceptance rate of the acoustic sub-system from 2.3% to 0.5%.

A0083.pdf

TOP


SUBWORD UNIT REPRESENTATIONS FOR SPOKEN DOCUMENT RETRIEVAL

Authors: Kenney Ng and Victor W. Zue

Spoken Language Systems Group MIT Laboratory for Computer Science 545 Technology Square, Cambridge, MA 02139 USA fkng, zueg@mit.edu

Volume 3 pages 1607 - 1610

ABSTRACT

This paper investigates the feasibility of using subword unit representations for spoken document retrieval as an alternative to using words generated by either keyword spotting or word recognition. Our investigation is motivated by the observation that word-based retrieval approaches face the problem of either having to know the keywords to search for a priori, or requiring a very large recognition vocabulary in order to cover the contents of growing and diverse message collections. In this study, we examine a range of subword units of varying complexity derived from phonetic transcriptions. The basic underlying unit is the phone; more and less complex units are derived by varying the level of detail and the length of sequences of the phonetic units. We measure the ability of the different subword units to effectively index and retrieve a large collection of recorded speech messages. We also compare their performance when the underlying phonetic transcriptions are perfect and when they contain phonetic recognition errors.

A0102.pdf

TOP


NON-LINEAR REPRESENTATIONS, SENSOR RELIABILITY ESTIMATION AND CONTEXT-DEPENDENT FUSION IN THE AUDIOVISUAL RE-COGNITION OF SPEECH IN NOISE

Authors: Pascal Teissier (1) (2) , Jean-Luc Schwartz (1) and Anne Guérin-Dugué (2)

(1) Institut de la Communication Parlée CNRS UPRESA 5009 / INPG - U. Stendhal ICP, INPG, 46 Av. Félix-Viallet, 38031 Grenoble Cedex 1 / (teissier, schwartz)@icp.grenet.fr (2) Laboratoire de Traitement d'Images et de Reconnaissance des Formes LTIRF, INPG, 46 Av. Félix-Viallet, 38031 Grenoble Cedex 1 / guerin@tirf.inpg.fr

Volume 3 pages 1611 - 1614

ABSTRACT

The paper involves the recognition of French audiovisual vowels at various signal-to-noise ratios (SNRs). It deals with a new non-linear preprocessing of the audio data which enables an estimation of the reliability of the audio sensor in relation to SNR, and a significant increase in the recognition performances at the output of the fusion process.

A0110.pdf

TOP


SECURIZED FLEXIBLE VOCABULARY VOICE MESSAGING SYSTEM ON UNIX WORKSTATION WITH ISDN CONNECTION

Authors: Philippe Renevey Andrzej Drygajlo

Signal Processing Laboratory, Swiss Federal Institute of Technology, CH-1015 Lausanne-Switzerland e-mail: Philippe.Renevey@lts.de.ep .ch, Andrzej.Drygajlo@lts.de.ep .ch

Volume 3 pages 1615 - 1618

ABSTRACT

This paper considers applications of automatic speech recognition and speaker verification techniques in developing efficient Voice Messaging Systems for Integrated Services Digital Network (ISDN) based communication systems. The prototype demonstrator presented was developed in the framework of cooperative project which involves two research institutes: the Signal Processing Laboratory (LTS) of the Swiss Federal Institute of Technology Lausanne (EPFL) and the Dalle Molle Institute for Perceptive Artificial Intelligence, Martigny (IDIAP), and three industrial partners: the Advanced Communication Services (aComm), SunMicrosystems (Switzerland) and the Swiss Telecom PTT [1]. The project is supported by the Commission for Technology and Innovation (CTI). The goal of the project is to make available basic technologies for automatic speech recognition and speaker verification on multi-processor SunSPARC workstation and SwissNet (ISDN) platform to industrial partners. The developed algorithms provide the necessary tools to design and implement workstation oriented voice messaging demonstrators for telephone quality Swiss French. The speech recognition algorithms are based on speaker independent exible vocabulary technology and speaker verification is performed by a number of techniques executed in parallel, and combined for optimal decision. The recognition results obtained validate the exible vocabulary approach which offers the potential to build word models for any application vocabulary from a single set of phonetic sub-word units trained with the Swiss French Polyphone database.

A0120.pdf

TOP


AUTOMATIC DERIVATION OF MULTIPLE VARIANTS OF PHONETIC TRANSCRIPTIONS FROM ACOUSTIC SIGNALS

Authors: Houda Mokbel and Denis Jouvet

France Telecom - CNET/DIH/RCP 2 avenue Pierre Marzin, 22307 Lannion cedex, France e-mail: fmokbelh, jouvetg@lannion.cnet.fr

Volume 3 pages 1619 - 1622

ABSTRACT

This paper deals with two methods for automatically finding multiple phonetic transcriptions of words, given sample utterances of the words and an inventory of context-dependent subword units. The two approaches investigated are based on an analysis of the N-best phonetic decoding of the available utterances. In the set of transcriptions resulting fromthe N-best decoding of all the utterances, the first method selects the K most frequent variants (Frequency Criterion) , while the second method selects the K most likely ones (Maximum Likelihood Criterion). Experiments carried out on speaker-independent recognition showed that the performance obtained with the "Maximum Likelihood Criterion" is not much different from that obtained with manual transcriptions. In the case of speaker-dependent speech recognition, the estimate of the 3 most likely transcription variants of each word, yields promising results.

A0128.pdf

TOP


IMPROVED BIMODAL SPEECH RECOGNITION USING TIED-MIXTURE HMMS AND 5000 WORD AUDIO-VISUAL SYNCHRONOUS DATABASE

Authors: Satoshi NAKAMURA, Ron NAGAI, Kiyohiro SHIKANO

Graduate School of Information Science, Nara Institute of Science and Technology 8916-5, Takayama-cho, Ikoma-shi, Nara, 630-01, JAPAN nakamura@is. aist-nara.ac.jp

Volume 3 pages 1623 - 1626

ABSTRACT

This paper presents methods to improve speech recognition accuracy by incorporating automatic lip reading. The paper improves lip reading accu- racy by following approaches; 1)collection of image and speech synchronous data of 5240 words, 2)feature extraction of 2-dimensional power spectra around a mouth and 3)sub-word unit HMMs with tied-mixture distribution(Tied-Mixture HMMs). Experiments through 100 word test show the performance of 85% by lipreading alone. It is also shown that tied-mixture HMMs improve the lip reading accuracy. The speech recognition experiments are carried out over various SNR integrating audio-visual information. The results show the integration always realizes better performance than that using either audio or visual information.

A0147.pdf

TOP


ON THE USE OF PHONE DURATION AND SEGMENTAL PROCESSING TO LABEL SPEECH SIGNAL

Authors: Philippe Depambour*, Régine André-Obrecht*, Bernard Delyon**

depambou@irit.fr, obrecht@irit.fr, delyon@irisa.fr *IRIT - UMR CNRS 5505, 118 route de Narbonne, F-31062 Toulouse Cedex, FRANCE ** IRISA, Campus universitaire de Beaulieu, F-35042 Rennes Cedex, FRANCE

Volume 3 pages 1627 - 1630

ABSTRACT

This paper presents recent work on continuous speech labelling. We propose an original automatic labelling system where elementary phone models take a segmental analysis and the phone duration into account. These models are initialized by a short speaker-independent training stage in order to constitute a model database. From the standard phonetic transcription, phonological rules are gathered to process the various pronunciations. For each new corpus or speaker, a new quick unsupervised adaptation stage is performed to re-estimate the models, and then follows the correct labelling. We assess this system by labelling a difficult corpus (sequences of connected spelled letter) and sentences of one speaker of the BREF80 corpus. These results are quite promising, in the two experiments less than 9% of phonetic boundaries are incorrectly located.

A0272.pdf

TOP


Automatic Detection of Disturbing Robot Voice- and Ping Pong-Effects in GSM Transmitted Speech

Authors: by Martin Paping and Thomas Fahnle

Ascom Systec AG Gewerbepark CH-5506 Magenwil, Switzerland E-Mail: Martin.Paping@ascom.ch

Volume 3 pages 1631 - 1634

ABSTRACT

This contribution reports about a method to automatically detect the disturbing Robot Voice and Ping Pong effect which occur in GSM transmitted speech. Both effects are caused by the frame substitution technique, recommended by the GSM standard: in these cases the transmitted speechmay be modulated by a disturbing 50 Hz component. These modulations can be detected very easily in the frequency domain. By a framewise comparision of the modulation amplitude of an undisturbed clean speech signal with a test signal it is possible to locate the occurrence of Robot Voice and Ping Pong very precisely. Comparing human perception to the outcome of the proposed algorithm shows a high degree of correspondence.

A0297.pdf

TOP


SPEECH SYNTHESIS USING PHASE VOCODER TECHNIQUES

Authors: JOSEPH DI MARTINO

Loria Laboratory Université Henri Poincaré Nancy I BP. 239; 54506 VANDOEUVRE FRANCE Tel. (+33) 03 83 59 20 36 Fax: (+33) 03 83 41 30 79, E-mail: jdm@loria.fr

Volume 3 pages 1635 - 1638

ABSTRACT

A new light is thrown on the Portnoff [1] speech signal time-scale modification algorithm. It is shown in particular that the Portnoff algorithm easily accommodates expansion factors bigger than 2 without causing reverberation nor chorusing. The modified Portnoff algorithm, which draws on spectral modification techniques due to Seneff [2], has been tested on several speech signals. The quality of the synthesized signal is totally satisfactory even for big expansion factors. The article gives a brief summary of the Portnoff algorithm and spells out the modifications introduced. It is shown that the phase unwrapping procedure constitutes a crucial point of the algorithm.

A0442.pdf

Recordings

TOP


INTEGRATION OF EYE FIXATION INFORMATION WITH SPEECH RECOGNITION SYSTEMS

Authors: Ramesh R. Sarukkai Craig Hunter

Dept. of Computer Science University of Rochester Rochester, NY-14627 e-mail: ramesh@kurzweil.com ; craigh@isn.com

Volume 3 pages 1639 - 1643

ABSTRACT

In this paper, a semi-tight coupling between visual and auditory modalities is proposed: in particular, eye fixation information is used to enhance the output of speech recognition systems. This is achieved by treating natural human eye fixations as diectic references to symbolic objects, and passing on this information to the speech recognizer. The speech recognizer biases its search towards these set of symbols/words during the best word sequence search process. As an illustrative example, the TRAINS interactive planning assistant system has been used as a test-bed; eye-fixations provide important cues to city names which the user sees on the map. Experimental results indicate that eye fixations help reduce speech recognition errors. This work suggests that integrating information from different interfaces to bootstrap each other would enable the development of reliable and robust interactive multi-modal human- computer systems.

A0710.pdf

TOP


GENERATION OF BROADBAND SPEECH FROM NARROWBAND SPEECH USING PIECEWISE LINEAR MAPPING

Authors: Y. Nakatoh, M. Tsushima, and T. Norimatsu

Multimedia Development Center Matsushita Electric Industrial Co., Ltd. 1006 Kadoma, Kadoma-shi, Osaka, 571 JAPAN. Tel. +81 6 906 4552, FAX: +81 6 908 6802, E-mail: nakatoh@arl.drl.mei.co.jp

Volume 3 pages 1643 - 1646

ABSTRACT

This paper proposes a recovery method of broadband speech form narrowband speech based on piecewise linear mapping. In this method, narrowband spectrum envelope of input speech is transformed to broadband spectrum envelope using linearly transformed matrices which are associated with several spectrum spaces. These matrices were estimated by speech training data, so as to minimize the mean square error between the transformed and the original spectra. This algorithm is compared the following other methods, (1)the codebook mapping, (2)the neural network. Through the evaluation by the spectral distance measure, it was found that the proposed method achieved a lower spectral distortion than the other methods. Perceptual experiments indicates a good performance for the reconstructed broadband speech.

A0726.pdf

TOP


AN ASSESSMENT OF THE BENEFITS ACTIVE NOISE REDUCTION SYSTEMS PROVIDE TO SPEECH INTELLIGIBILITY IN AIRCRAFT NOISE ENVIRONMENTS

Authors: I. E.C. Rogers

Systems Integration Department, Air Systems Sector Defence Evaluation and Research Agency Farnborough, Hampshire GU14 0LX, United Kingdom. Tel. +44 1252 39 23 48, Fax: +44 1252 39 30 91

Volume 3 pages 1647 - 1650

ABSTRACT

The high noise levels being experienced in some military fast jet aircraft and helicopters generally result in a reduction in the intelligibility of speech communications. A study has been conducted to assess the effect of reducing noise levels at the ear, by the use of current Active Noise Reduction (ANR) systems, on speech intelligibility in aircraft noise environments. The results of this study indicate that ANR would improve speech intelligibility in both types of aircraft. The assessment has been conducted using Diagnostic Rhyme Test (DRT) and Articulation Index (AI) techniques. The study has also allowed the correlation between DRT and AI test results to be investigated. A more detailed account of the work reported in this paper is provided at [1].

A0805.pdf

TOP


OLGA - A Dialogue System with an Animated Talking Agent

Authors: Jonas Beskow (1), Kjell Elenius (1) & Scott McGlashan (2)

(1) Department of Speech, Music and Hearing, KTH, Stockholm (2) Swedish Institute for Computer Science, Stockholm now at Ericsson Radio Systems, Stockholm E-mail: beskow@speech.kth.se, kjell@speech.kth.se, scott.mcglashan@era-t.ericsson.se

Volume 3 pages 1651 - 1654

ABSTRACT

The object of the Olga project is to develop an interactive 3D animated talking agent. A futuristic application scenario is interactive digital TV, where the Olga agent would guide naive users through the various services available on the network. The current application is a consumer information service for microwave ovens. Olga required the development of a system with components from many different fields: multimodal interfaces, dialogue management, speech recognition, speech synthesis, graphics, animation, facilities for direct manipulation and database handling. To integrate all knowledge sources Olga is implemented with separate modules communicating with a central dialogue interaction manager. In this paper we mainly describe the talking animated agent and the dialogue manager. There is also a short description of the preliminary speech recogniser used in the project.

A0912.pdf

TOP


TOWARDS USABLE MULTIMODAL COMMAND LANGUAGES: DEFINITION AND ERGONOMIC ASSESSMENT OF CONSTRAINTS ON USERS' SPONTANEOUS SPEECH AND GESTURES

Authors: S. Robbe*, N. Carbonell*, C. Valot**

*CRIN-CNRS & INRIA-Lorraine, BP. 239, F54506 Vandoeuvre les Nancy Cedex **IMASSA-CERMA, BP 73, F91223 Brétigny sur Orge Cedex *Tel. 33 3 83 59 20 47 FAX: 33 3 83 41 30 79, E-mail: {robbe, carbo}@loria.fr **Tel. 33 1 69 88 33 70 FAX: 33 1 69 88 33 75, E-mail: claude@cerma.fr

Volume 3 pages 1655 - 1658

ABSTRACT

Within the framework of a prospective ergonomic approach, we simulated two multimodal user interfaces, in order to study the usability of constrained vs spontaneous speech in a multimodal environment. The first experiment, which served as a reference, gave subjects the opportunity to use speech and gestures freely, while subjects in the second experiment had to comply with multimodal constraints. We first describe the experimental setup and the approach we adopted for designing the artificial command language used in the second experiment. We then present the results of our analysis of the subjects' utterances and gestures, laying emphasis on their implementation of linguistic constraints. The conclusions of the empirical assessment of the usability of this multimodal command language built from a restricted subset of natural language and simple designation gestures is associated with recommendations which may prove useful for improving the usability of oral human-computer interaction in a multimodal environment.

A0938.pdf

TOP


EXPLOITING REPAIR CONTEXT IN INTERACTIVE ERROR RECOVERY

Authors: Bernhard Suhm and Alex Waibel

Interactive Systems Laboratories Carnegie Mellon University, Pittsburgh PA, USA University of Karlsruhe, Karlsruhe, Germany Email: {bsuhm,ahw}@cs.cmu.edu

Volume 3 pages 1659 - 1662

ABSTRACT

In current speech applications, facilities to correct recognition errors are limited to either choosing among alternative hypotheses (either by voice or by mouseclick) or respeaking. Information from the context a repair is ignored. We developed a method which improves the accuracy of correcting speech recognition errors interactively by taking into account the context of the repair interaction. The basic idea is to use the same language modeling information used in the initial decoding of continuous speech input for decoding (isolated word) repair input. The repair is not limited to speech, but the user can choose to switch modality, for instance spelling or handwriting a word. We implemented this idea by rescoring N-best lists obtained from decoding the repair input using language model scores for trigrams which include the corrected word. We evaluated the method on a set of repairs by respeaking, spelling and handwriting which we collected with our prototypical continuous speech dictation interface. The method can increase the accuracy of repair significantly, compared to recognizing the repair input as independent event.

A1049.pdf

TOP


AN HYBRID IMAGE PROCESSING APPROACH TO LIPTRACKING INDEPENDENT OF HEAD ORIENTATION

Authors: L. Revéret 1 , F. Garcia 2 , C. Benoît 1 , E. Vatikiotis-Bateson 2

Institut de la Communication Parlée (1) INPG/ENSERG/Université Stendhal, BP25 38040 Cedex 9 Grenoble, France. Tel. +33 4 76 82 41 28, FAX: +33 4 76 82 43 35 (2) HIP-ATR Laboratories, 2-2 Hikaridai, Seika-sho, Soraku-gun, Kyoto, 619-02, Japan Tel. +81 774 95 1011, FAX: +81 774 95 1008 E-mail: {reveret, benoit}@icp.grenet.fr, {garcia, bateson}@hip.atr.co.jp

Volume 3 pages 1663 - 1666

ABSTRACT

This paper examines the influence of head orientation in liptracking. There are two main conclusions: First, lip gesture analysis and head movement correction should be processed independently. Second, the measurement of articulatory parameters may be corrupted by head movement if it is performed directly at the pixel level. We thus propose an innovative technique of liptracking which relies on a "3D active contour" model of the lips controlled by articulatory parameters. The 3D model is projected onto the image of a speaking face through a camera model, thus allowing spatial re-orientation of the head. Liptracking is then performed by automatic adjustment of the control parameters, independently of head orientation. The final objective of our study is to apply a pixel-based method to detect head orientation. Nevertheless, we consider that head motion and lip gestures are detected by different processes, whether cognitive (by humans) or computational (by machines). Due to this, we decided to first develop and evaluate orientation-free liptracking through a non video-based head motion detection technique which is here presented.

A1051.pdf

TOP


AUTOMATIC MODELING OF COARTICULATION IN TEXT-TO-VISUAL SPEECH SYNTHESIS

Authors: Bertrand Le Goff

Institut de la Communication Parlée UPRESA5009/ INPG/Université Stendhal BP25X, 38000 GRENOBLE CEDEX Tel. +33 (0)4 76 82 41 28, FAX: +33 (0)4 76 82 43 35, E-mail: legoff@icp.grenet.fr

Volume 3 pages 1667 - 1670

ABSTRACT

We have developed a visual speech synthesizer from unlimited French text, and synchronized it to an audio text-to-speech synthesizer also developed at the ICP (Le Goff & Benoît, 1996). The front-end of our synthesizer is a 3-D model of the face whose speech gestures are controlled by eight parameters: Five for the lips, one for the chin, two for the tongue. In contrast to most of the existing systems which are based on a limited set of prestored facial images, we have adopted the parametric approach to coarticulation first proposed by Cohen and Massaro (1993). We have thus implemented a coarticulation model based on spline-like functions, defined by three coefficients, applied to each target in a library of 16 French visemes. However, unlike Cohen & Massaro (1993), we have adopted a data-driven approach to identify the many coefficients necessary to model coarticulation. To do so, we systematically analyzed an ad-hoc corpus uttered by a French male speaker. We have then run an intelligibility test to quantify the benefit of seeing the synthetic face (in addition to hearing the synthetic voice) under several conditions of background noise.

A1086.pdf

TOP


A MULTIMEDIA PLATFORM FOR AUDIO-VISUAL SPEECH PROCESSING

Authors: A. Adjoudani, T. Guiard-Marigny, B. Le Goff, L. Reveret & C. Benoît

Institut de la Communication Parlée UPRESA CNRS n° 5009 INPG-ENSERG / Université Stendhal, Grenoble, France Tel. +33 4 82 43 36, FAX: +33 4 82 43 35, E-mail: benoit@icp.grenet.fr

Volume 3 pages 1671 - 1674

ABSTRACT

In the framework of the European ESPRIT Project MIAMI ("Multimodal Integration for Advanced Multimedia Interfaces"), a platform has been developed at the ICP to study the various combinations of audio-visual speech processing, including real-time lip motion analysis, real-time synthesis of models of the lips and of the face, audiovisual speech recognition of isolated words, and text-to-audio-visual speech synthesis in French. All these facilities are implemented on a network of three SGI computers. Not only this platform is a usefull research tool to study the production and the perception of visible speech as well as audio-visual integration by humans and by the machines, but it is also a nice testbed to study man-machine multimodal interaction and very low bit rate audio-visual speech communication between humans.

A1087.pdf

TOP


AN INTELLIGENT SYSTEM FOR INFORMATION RETRIEVAL OVER THE INTERNET THROUGH SPOKEN DIALOGUE

Authors: Hiroya Fujisaki (1) , Hiroyuki Kameda (2) , Sumio Ohno (1) , Takuya Ito (1) , Ken Tajima (1) , and Kenji Abe (1)

(1) Department of Applied Electronics, Science University of Tokyo 2641 Yamazaki, Noda, 278 Japan (2) Tokyo Engineering University, 1404-1 Katakura, Hachiouji,192 Japan

Volume 3 pages 1675 - 1678

ABSTRACT

For the purpose of coping with the affluence of information available over the Internet, an efficient, accurate and user-friendly system for information retrieval is mandatory. This paper presents an intelligent system based on the use of spoken dialogue as the main channel for user-system interface, use of key concepts, processing of unknown words, automatic acquisition of various kinds of knowledge for improving the performance, and agent technologies for system realization. Details of functions required for the agents are also described.

A1098.pdf

TOP


DATA HIDING IN SPEECH USING PHASE CODING

Authors: Yasemin Yardimci (1) , A. Enis Cetin (1) and Rashid Ansari (2)

(1) Department of Electrical and Electronic Engineering; Bilkent University; Ankara 06533; Turkey: (2) Department of Electrical Engineering and Computer Science University of Illinois; Chicago; Illinois; USA:

Volume 3 pages 1679 - 1682

ABSTRACT

The human auditory system is insensitive to phase information of the speech signal. By taking advantage of this fact data such as the transcript, some keywords, and copyright information can be embedded into the speech signal by altering the phase in a predefined manner. In this paper, an all-pass filtering based data embedding scheme is developed for speech signals. Since all-pass filters modify only the phase without effecting the magnitude response they are employed to diffuse data into the speech signal by filtering different portions of the speech signal with different all-pass filters. The embedded data can be retrieved by tracking the zeros of the all-pass filters.

A1102.pdf

TOP


CAVE: An on-line procedure for creating and running auditory-visual speech perception experiments - hardware, software, and advantages

Authors: Denis Burnham, John Fowler, and Michelle Nicol

School of Psychology, University of NSW, Sydney, 2052, Australia. Tel. +61 2 9385 30 25, Fax: +61 2 9385 36 41, email: d.burnham@unsw.edu.au

Volume 3 pages 1683 - 1686

ABSTRACT

The McGurk effect or fusion illusion, in which mismatched auditory and visual speech sound components are perceived as an emergent phone, is extensively used in auditory-visual speech perception research. The usual method of running experiments involves time-consuming preparation of dubbed videotapes. This paper describes an alternative, the Computerised Auditory-Visual Experiment (CAVE), in which audio dubbing occurs on-line. Its advantages include reduced preparation time, greater flexibility, and on-line collection of response type and latency data.

A1192.pdf

TOP