Session Th2C Lipreading

Chairperson Christian Benoit ICP, Universite Stendhal, France

Home

TOWARDS SPEAKER INDEPENDENT CONTINUOUS SPEECHREADING

Authors: Juergen Luettin

IDIAP CP 592, 1920 Martigny, Switzerland luettin@idiap.ch

Volume 4 pages 1991 - 1994

ABSTRACT

This paper describes recent speechreading experiments for a speaker independent continuous digit recognition task. Visual feature extraction is performed by a lip tracker which recovers information about the lip shape and information about the grey- level intensity around the mouth. These features are used to train visual word models using continuous density HMMs. Results show that the method generalises well to new speakers and that the recognition rate is highly variable across digits as expected due to the high visual confusability of certain words.

A0039.pdf

TOP

Driving Synthetic Mouth Gestures: Phonetic Recognition for FaceMe!

Authors: William Goldenthal, Keith Waters, Jean-Manuel Van Thong, and Oren Glickman

email: fthal, waters, jmvt, oreng@crl.dec.com Digital Equipment Corporation Cambridge Research Laboratory One Kendall Sq., Building 700 Cambridge, Massachusetts 02139 USA

Volume 4 pages 1995 - 1998

ABSTRACT

The goal of this work is to use phonetic recognition to drive a synthetic image with speech. Phonetic units are identied by the phonetic recognition engine and mapped to mouth gestures, known as visemes, the visual counter- part of phonemes. The acoustic waveform and visemes are then sent to a synthetic image player, called FaceMe! where they are rendered synchronously. This paper provides background for the core technologies involved in this process and describes asynchronous and synchronous prototypes of a combined phonetic recognition/FaceMe! system which we use to render mouth gestures on an animated face.

A0040.pdf

TOP

Continuous visual speech recognition using geometric lip-shape models and neural networks

Authors: Alexandrina Rogozan and Paul Deléglise

Laboratoire d'Informatique de l'Université du Maine Université du Maine, 72085 Le Mans Cedex 9, France Tel: +33 02 43 83 38 64, Fax: +33 02 43 83 38 68 E-mail: Alexandrina.Foucault@lium.univ-lemans.fr

Volume 4 pages 1999 - 2002

ABSTRACT

This paper describes a new approach for automatic speechreading. First, we use efficient, but effective representation of visible speech: a geometric lip-shape model. Then we present an automatic objective method to merge phonemes that appear visually similar into visemes for our speaker. In order to determine visemes, we trained SOM using the Kohonen algorithm on each phoneme extracted from our visual database. We go into the presentation of our visual speech recognition systems based on heuristics and neural networks (TDNN or JNN) trained to discriminate visual information. On a continuous spelling task, visual-alone recognition performance of about 37 % was achieved using the TDNN and about 33 % using the JNN one.

A0066.pdf

TOP

The Teleface project Multi-modal speech-communication for the hearing impaired

Authors: Jonas Beskow, Martin Dahlquist, Björn Granström, Magnus Lundeberg, Karl-Erik Spens & Tobias Öhman (in alphabetical order)

Department of Speech, Music and Hearing, KTH S-100 44 Stockholm, Sweden. Tel. +46 8 790 7879, FAX: +46 8 790 7854, E-mail: teleface@speech.kth.se

Volume 4 pages 2003 - 2006

ABSTRACT

The Teleface Project, a project that aims at evaluating the possibilities for a telephone communication aid for hard of hearing persons, is presented as well as the different parts of the project: audio-visual speech synthesis, visual speech measurement and multimodal speech intelligibility studies. The experiments showed a noticeable intelligibility advantage for the addition of the face information, both for natural and synthetic faces.

A0923.pdf

TOP

REAL-TIME LIP-TRACKING FOR LIPREADING

Authors: Rainer Stiefelhagen, Uwe Meier, Jie Yang

{stiefel|uwe}@ira.uka.de, yang+@cs.cmyu,.edu, Interactive Systems Laboratories University of Karlsruhe - Germany, Carnegie Mellon University - USA

Volume 4 pages 2007 - 2010

ABSTRACT

This paper presents a new approach to lip tracking for lipreading. Instead of only tracking features on lips, we propose to track lips along with other facial features such as pupils and nostril. In the new approach, the face is first located in an image using a stochastic skin-color model, the eyes, lip-corners and nostrils are then located and tracked inside the facial region. The new approach can effectively improve the robustness of lip-tracking and simplify automatic detection and recovery of tracking failure. The feasibility of the proposed approach has been demonstrated by implementation of a lip tracking sys- tem. The system has been tested by a database that contains 900 image sequences of different speakers spelling words. The system has successfully extract lip regions from the image sequences to obtain training data for the audio-visual speech recognition system. The system has been also applied to extract the lip region in real-time from live video images to obtain the visual input for an audio-visual speech recognition system. On test sequences we have achieved a reduction of the number of frames with tracking failures by a factor of two using detection and prediction of outliers in the set of found features.

A0962.pdf

TOP

FROM RAW IMAGES OF THE LIPS TO ARTICULATORY PARAMETERS : A VISEME-BASED PREDICTION

Authors: L. Revéret

Institut de la Communication Parlée Université Stendhal / INPG, BP25 38040 Cedex 9 Grenoble, France. Tel. +33 4 76 82 41 28, FAX: +33 4 76 82 43 35, E-mail: reveret@icp.grenet.fr

Volume 4 pages 2011 - 2014

ABSTRACT

This paper presents a method for the extraction of articulatory parameters from direct processing of raw images of the lips. The system architecture is made of three independent parts. First, a new greyscale mouth image is centred and downsampled. Second, the image is aligned and projected onto a basis of artificial images. These images are the eigenvectors computed from a PCA applied on a set of 23 reference lip shapes. Then, a multilinear interpolation predicts articulatory parameters from the image projection coefficients onto the eigenvectors. In addition, the projection coefficients and the predicted parameters were evaluated by an HMM-based visual speech recogniser. Recognition scores obtained with our method are compared to reference scores and discussed.

A1088.pdf