ABSTRACT
Fuzzy set theory and fuzzy logic has been initiated by Zadeh back in 1965 to pennit the treatment of vague, imprecise, and ill-defined knowledge in an concise manner. One of the unique advantages of fuzzy logic is that it is capable of directly incorporating and utilizing qualitative and heuristic knowledge in the form of causal if then production rules for reasoning and inference. On the other hand, rule-based speech synthesis based on formants makes considerable use of rules for numerous of the tasks it involves, e.g. graphemic to phonemic transcription, coarticulation, concatenation, and duration rules etc. These rules also take the if then form with their antecedent (condition) part describing the context of the rule and their decedent an appropriate action to be taken. The main motivation for introducing fuzzy logic in the synthesis-by-rule paradigm, is its ability to host and treat uncertainty and imprecision both in the condition part of the rule as well as its decedent part. This may be argued to significantly reduce the number of required rules while rendering them more meaningful and human-like.
ABSTRACT
This paper describes a multimodal approach for speaker verification. The system consists of two classifiers, one using visual features and the other using acoustic features. A lip tracker is used to extract visual information from the speaking face which provides shape and intensity features. We describe an approach for normalizing and mapping different modalities onto a common confidence interval. We also describe a novel method for integrating the scores of multiple classifiers. Verification experiments are reported for the individual modalities and for the combined classifier. The performance of the integrated system outperformed each sub-system and reduced the false acceptance rate of the acoustic sub-system from 2.3% to 0.5%.
ABSTRACT
This paper investigates the feasibility of using subword unit representations for spoken document retrieval as an alternative to using words generated by either keyword spotting or word recognition. Our investigation is motivated by the observation that word-based retrieval approaches face the problem of either having to know the keywords to search for a priori, or requiring a very large recognition vocabulary in order to cover the contents of growing and diverse message collections. In this study, we examine a range of subword units of varying complexity derived from phonetic transcriptions. The basic underlying unit is the phone; more and less complex units are derived by varying the level of detail and the length of sequences of the phonetic units. We measure the ability of the different subword units to effectively index and retrieve a large collection of recorded speech messages. We also compare their performance when the underlying phonetic transcriptions are perfect and when they contain phonetic recognition errors.
ABSTRACT
The paper involves the recognition of French audiovisual vowels at various signal-to-noise ratios (SNRs). It deals with a new non-linear preprocessing of the audio data which enables an estimation of the reliability of the audio sensor in relation to SNR, and a significant increase in the recognition performances at the output of the fusion process.
ABSTRACT
This paper considers applications of automatic speech recognition and speaker verification techniques in developing efficient Voice Messaging Systems for Integrated Services Digital Network (ISDN) based communication systems. The prototype demonstrator presented was developed in the framework of cooperative project which involves two research institutes: the Signal Processing Laboratory (LTS) of the Swiss Federal Institute of Technology Lausanne (EPFL) and the Dalle Molle Institute for Perceptive Artificial Intelligence, Martigny (IDIAP), and three industrial partners: the Advanced Communication Services (aComm), SunMicrosystems (Switzerland) and the Swiss Telecom PTT [1]. The project is supported by the Commission for Technology and Innovation (CTI). The goal of the project is to make available basic technologies for automatic speech recognition and speaker verification on multi-processor SunSPARC workstation and SwissNet (ISDN) platform to industrial partners. The developed algorithms provide the necessary tools to design and implement workstation oriented voice messaging demonstrators for telephone quality Swiss French. The speech recognition algorithms are based on speaker independent exible vocabulary technology and speaker verification is performed by a number of techniques executed in parallel, and combined for optimal decision. The recognition results obtained validate the exible vocabulary approach which offers the potential to build word models for any application vocabulary from a single set of phonetic sub-word units trained with the Swiss French Polyphone database.
ABSTRACT
This paper deals with two methods for automatically finding multiple phonetic transcriptions of words, given sample utterances of the words and an inventory of context-dependent subword units. The two approaches investigated are based on an analysis of the N-best phonetic decoding of the available utterances. In the set of transcriptions resulting fromthe N-best decoding of all the utterances, the first method selects the K most frequent variants (Frequency Criterion) , while the second method selects the K most likely ones (Maximum Likelihood Criterion). Experiments carried out on speaker-independent recognition showed that the performance obtained with the "Maximum Likelihood Criterion" is not much different from that obtained with manual transcriptions. In the case of speaker-dependent speech recognition, the estimate of the 3 most likely transcription variants of each word, yields promising results.
ABSTRACT
This paper presents methods to improve speech recognition accuracy by incorporating automatic lip reading. The paper improves lip reading accu- racy by following approaches; 1)collection of image and speech synchronous data of 5240 words, 2)feature extraction of 2-dimensional power spectra around a mouth and 3)sub-word unit HMMs with tied-mixture distribution(Tied-Mixture HMMs). Experiments through 100 word test show the performance of 85% by lipreading alone. It is also shown that tied-mixture HMMs improve the lip reading accuracy. The speech recognition experiments are carried out over various SNR integrating audio-visual information. The results show the integration always realizes better performance than that using either audio or visual information.
ABSTRACT
This paper presents recent work on continuous speech labelling. We propose an original automatic labelling system where elementary phone models take a segmental analysis and the phone duration into account. These models are initialized by a short speaker-independent training stage in order to constitute a model database. From the standard phonetic transcription, phonological rules are gathered to process the various pronunciations. For each new corpus or speaker, a new quick unsupervised adaptation stage is performed to re-estimate the models, and then follows the correct labelling. We assess this system by labelling a difficult corpus (sequences of connected spelled letter) and sentences of one speaker of the BREF80 corpus. These results are quite promising, in the two experiments less than 9% of phonetic boundaries are incorrectly located.
ABSTRACT
This contribution reports about a method to automatically detect the disturbing Robot Voice and Ping Pong effect which occur in GSM transmitted speech. Both effects are caused by the frame substitution technique, recommended by the GSM standard: in these cases the transmitted speechmay be modulated by a disturbing 50 Hz component. These modulations can be detected very easily in the frequency domain. By a framewise comparision of the modulation amplitude of an undisturbed clean speech signal with a test signal it is possible to locate the occurrence of Robot Voice and Ping Pong very precisely. Comparing human perception to the outcome of the proposed algorithm shows a high degree of correspondence.
ABSTRACT
A new light is thrown on the Portnoff [1] speech signal time-scale modification algorithm. It is shown in particular that the Portnoff algorithm easily accommodates expansion factors bigger than 2 without causing reverberation nor chorusing. The modified Portnoff algorithm, which draws on spectral modification techniques due to Seneff [2], has been tested on several speech signals. The quality of the synthesized signal is totally satisfactory even for big expansion factors. The article gives a brief summary of the Portnoff algorithm and spells out the modifications introduced. It is shown that the phase unwrapping procedure constitutes a crucial point of the algorithm.
ABSTRACT
In this paper, a semi-tight coupling between visual and auditory modalities is proposed: in particular, eye fixation information is used to enhance the output of speech recognition systems. This is achieved by treating natural human eye fixations as diectic references to symbolic objects, and passing on this information to the speech recognizer. The speech recognizer biases its search towards these set of symbols/words during the best word sequence search process. As an illustrative example, the TRAINS interactive planning assistant system has been used as a test-bed; eye-fixations provide important cues to city names which the user sees on the map. Experimental results indicate that eye fixations help reduce speech recognition errors. This work suggests that integrating information from different interfaces to bootstrap each other would enable the development of reliable and robust interactive multi-modal human- computer systems.
ABSTRACT
This paper proposes a recovery method of broadband speech form narrowband speech based on piecewise linear mapping. In this method, narrowband spectrum envelope of input speech is transformed to broadband spectrum envelope using linearly transformed matrices which are associated with several spectrum spaces. These matrices were estimated by speech training data, so as to minimize the mean square error between the transformed and the original spectra. This algorithm is compared the following other methods, (1)the codebook mapping, (2)the neural network. Through the evaluation by the spectral distance measure, it was found that the proposed method achieved a lower spectral distortion than the other methods. Perceptual experiments indicates a good performance for the reconstructed broadband speech.
ABSTRACT
The high noise levels being experienced in some military fast jet aircraft and helicopters generally result in a reduction in the intelligibility of speech communications. A study has been conducted to assess the effect of reducing noise levels at the ear, by the use of current Active Noise Reduction (ANR) systems, on speech intelligibility in aircraft noise environments. The results of this study indicate that ANR would improve speech intelligibility in both types of aircraft. The assessment has been conducted using Diagnostic Rhyme Test (DRT) and Articulation Index (AI) techniques. The study has also allowed the correlation between DRT and AI test results to be investigated. A more detailed account of the work reported in this paper is provided at [1].
ABSTRACT
The object of the Olga project is to develop an interactive 3D animated talking agent. A futuristic application scenario is interactive digital TV, where the Olga agent would guide naive users through the various services available on the network. The current application is a consumer information service for microwave ovens. Olga required the development of a system with components from many different fields: multimodal interfaces, dialogue management, speech recognition, speech synthesis, graphics, animation, facilities for direct manipulation and database handling. To integrate all knowledge sources Olga is implemented with separate modules communicating with a central dialogue interaction manager. In this paper we mainly describe the talking animated agent and the dialogue manager. There is also a short description of the preliminary speech recogniser used in the project.
ABSTRACT
Within the framework of a prospective ergonomic approach, we simulated two multimodal user interfaces, in order to study the usability of constrained vs spontaneous speech in a multimodal environment. The first experiment, which served as a reference, gave subjects the opportunity to use speech and gestures freely, while subjects in the second experiment had to comply with multimodal constraints. We first describe the experimental setup and the approach we adopted for designing the artificial command language used in the second experiment. We then present the results of our analysis of the subjects' utterances and gestures, laying emphasis on their implementation of linguistic constraints. The conclusions of the empirical assessment of the usability of this multimodal command language built from a restricted subset of natural language and simple designation gestures is associated with recommendations which may prove useful for improving the usability of oral human-computer interaction in a multimodal environment.
ABSTRACT
In current speech applications, facilities to correct recognition errors are limited to either choosing among alternative hypotheses (either by voice or by mouseclick) or respeaking. Information from the context a repair is ignored. We developed a method which improves the accuracy of correcting speech recognition errors interactively by taking into account the context of the repair interaction. The basic idea is to use the same language modeling information used in the initial decoding of continuous speech input for decoding (isolated word) repair input. The repair is not limited to speech, but the user can choose to switch modality, for instance spelling or handwriting a word. We implemented this idea by rescoring N-best lists obtained from decoding the repair input using language model scores for trigrams which include the corrected word. We evaluated the method on a set of repairs by respeaking, spelling and handwriting which we collected with our prototypical continuous speech dictation interface. The method can increase the accuracy of repair significantly, compared to recognizing the repair input as independent event.
ABSTRACT
This paper examines the influence of head orientation in liptracking. There are two main conclusions: First, lip gesture analysis and head movement correction should be processed independently. Second, the measurement of articulatory parameters may be corrupted by head movement if it is performed directly at the pixel level. We thus propose an innovative technique of liptracking which relies on a "3D active contour" model of the lips controlled by articulatory parameters. The 3D model is projected onto the image of a speaking face through a camera model, thus allowing spatial re-orientation of the head. Liptracking is then performed by automatic adjustment of the control parameters, independently of head orientation. The final objective of our study is to apply a pixel-based method to detect head orientation. Nevertheless, we consider that head motion and lip gestures are detected by different processes, whether cognitive (by humans) or computational (by machines). Due to this, we decided to first develop and evaluate orientation-free liptracking through a non video-based head motion detection technique which is here presented.
ABSTRACT
We have developed a visual speech synthesizer from unlimited French text, and synchronized it to an audio text-to-speech synthesizer also developed at the ICP (Le Goff & Benoît, 1996). The front-end of our synthesizer is a 3-D model of the face whose speech gestures are controlled by eight parameters: Five for the lips, one for the chin, two for the tongue. In contrast to most of the existing systems which are based on a limited set of prestored facial images, we have adopted the parametric approach to coarticulation first proposed by Cohen and Massaro (1993). We have thus implemented a coarticulation model based on spline-like functions, defined by three coefficients, applied to each target in a library of 16 French visemes. However, unlike Cohen & Massaro (1993), we have adopted a data-driven approach to identify the many coefficients necessary to model coarticulation. To do so, we systematically analyzed an ad-hoc corpus uttered by a French male speaker. We have then run an intelligibility test to quantify the benefit of seeing the synthetic face (in addition to hearing the synthetic voice) under several conditions of background noise.
ABSTRACT
In the framework of the European ESPRIT Project MIAMI ("Multimodal Integration for Advanced Multimedia Interfaces"), a platform has been developed at the ICP to study the various combinations of audio-visual speech processing, including real-time lip motion analysis, real-time synthesis of models of the lips and of the face, audiovisual speech recognition of isolated words, and text-to-audio-visual speech synthesis in French. All these facilities are implemented on a network of three SGI computers. Not only this platform is a usefull research tool to study the production and the perception of visible speech as well as audio-visual integration by humans and by the machines, but it is also a nice testbed to study man-machine multimodal interaction and very low bit rate audio-visual speech communication between humans.
ABSTRACT
For the purpose of coping with the affluence of information available over the Internet, an efficient, accurate and user-friendly system for information retrieval is mandatory. This paper presents an intelligent system based on the use of spoken dialogue as the main channel for user-system interface, use of key concepts, processing of unknown words, automatic acquisition of various kinds of knowledge for improving the performance, and agent technologies for system realization. Details of functions required for the agents are also described.
ABSTRACT
The human auditory system is insensitive to phase information of the speech signal. By taking advantage of this fact data such as the transcript, some keywords, and copyright information can be embedded into the speech signal by altering the phase in a predefined manner. In this paper, an all-pass filtering based data embedding scheme is developed for speech signals. Since all-pass filters modify only the phase without effecting the magnitude response they are employed to diffuse data into the speech signal by filtering different portions of the speech signal with different all-pass filters. The embedded data can be retrieved by tracking the zeros of the all-pass filters.
ABSTRACT
The McGurk effect or fusion illusion, in which mismatched auditory and visual speech sound components are perceived as an emergent phone, is extensively used in auditory-visual speech perception research. The usual method of running experiments involves time-consuming preparation of dubbed videotapes. This paper describes an alternative, the Computerised Auditory-Visual Experiment (CAVE), in which audio dubbing occurs on-line. Its advantages include reduced preparation time, greater flexibility, and on-line collection of response type and latency data.