Utterance Verification and Word Spotting 1 / Speaker Adaptation 1

Home
Full List of Titles
1: ICSLP'98 Proceedings
Keynote Speeches
Text-To-Speech Synthesis 1
Spoken Language Models and Dialog 1
Prosody and Emotion 1
Hidden Markov Model Techniques 1
Speaker and Language Recognition 1
Multimodal Spoken Language Processing 1
Isolated Word Recognition
Robust Speech Processing in Adverse Environments 1
Spoken Language Models and Dialog 2
Articulatory Modelling 1
Talking to Infants, Pets and Lovers
Robust Speech Processing in Adverse Environments 2
Spoken Language Models and Dialog 3
Speech Coding 1
Articulatory Modelling 2
Prosody and Emotion 2
Neural Networks, Fuzzy and Evolutionary Methods 1
Utterance Verification and Word Spotting 1 / Speaker Adaptation 1
Text-To-Speech Synthesis 2
Spoken Language Models and Dialog 4
Human Speech Perception 1
Robust Speech Processing in Adverse Environments 3
Speech and Hearing Disorders 1
Prosody and Emotion 3
Spoken Language Understanding Systems 1
Signal Processing and Speech Analysis 1
Spoken Language Generation and Translation 1
Spoken Language Models and Dialog 5
Segmentation, Labelling and Speech Corpora 1
Multimodal Spoken Language Processing 2
Prosody and Emotion 4
Neural Networks, Fuzzy and Evolutionary Methods 2
Large Vocabulary Continuous Speech Recognition 1
Speaker and Language Recognition 2
Signal Processing and Speech Analysis 2
Prosody and Emotion 5
Robust Speech Processing in Adverse Environments 4
Segmentation, Labelling and Speech Corpora 2
Speech Technology Applications and Human-Machine Interface 1
Large Vocabulary Continuous Speech Recognition 2
Text-To-Speech Synthesis 3
Language Acquisition 1
Acoustic Phonetics 1
Speaker Adaptation 2
Speech Coding 2
Hidden Markov Model Techniques 2
Multilingual Perception and Recognition 1
Large Vocabulary Continuous Speech Recognition 3
Articulatory Modelling 3
Language Acquisition 2
Speaker and Language Recognition 3
Text-To-Speech Synthesis 4
Spoken Language Understanding Systems 4
Human Speech Perception 2
Large Vocabulary Continuous Speech Recognition 4
Spoken Language Understanding Systems 2
Signal Processing and Speech Analysis 3
Human Speech Perception 3
Speaker Adaptation 3
Spoken Language Understanding Systems 3
Multimodal Spoken Language Processing 3
Acoustic Phonetics 2
Large Vocabulary Continuous Speech Recognition 5
Speech Coding 3
Language Acquisition 3 / Multilingual Perception and Recognition 2
Segmentation, Labelling and Speech Corpora 3
Text-To-Speech Synthesis 5
Spoken Language Generation and Translation 2
Human Speech Perception 4
Robust Speech Processing in Adverse Environments 5
Text-To-Speech Synthesis 6
Speech Technology Applications and Human-Machine Interface 2
Prosody and Emotion 6
Hidden Markov Model Techniques 3
Speech and Hearing Disorders 2 / Speech Processing for the Speech and Hearing Impaired 1
Human Speech Production
Segmentation, Labelling and Speech Corpora 4
Speaker and Language Recognition 4
Speech Technology Applications and Human-Machine Interface 3
Utterance Verification and Word Spotting 2
Large Vocabulary Continuous Speech Recognition 6
Neural Networks, Fuzzy and Evolutionary Methods 3
Speech Processing for the Speech-Impaired and Hearing-Impaired 2
Prosody and Emotion 7
2: SST Student Day
SST Student Day - Poster Session 1
SST Student Day - Poster Session 2

Author Index
A B C D E F G H I
J K L M N O P Q R
S T U V W X Y Z

Multimedia Files

Word Verification Using Confidence Measures in Speech Recognition

Authors:

M. Carmen Benítez, Universidad de Granada (Spain)
Antonio Rubio, Universidad de Granada (Spain)
Pedro García, Universidad de Granada (Spain)
Jesus Diaz-Verdejo, Universidad de Granada (Spain)

Page (NA) Paper number 1082

Abstract:

In this work we propose a novel way of discriminating the words that are recognized by a speech recognition system as correctly or incorrectly detected words. The procedure consists of the extraction of a set of characteristics for each word. Utilizing these characteristics, we have built two classifiers: the first one is a vector quantizer, while the second one, though also a vector quantizer, was trained using adaptative technique learning (LVQ). The results obtained show an improvement in the performance of the recognizer achieved by reducing the number of insertions with no significant reduction in the correctly detected words.

SL981082.PDF (From Author) SL981082.PDF (Rasterized)

TOP


Improving Posterior Based Confidence Measures in Hybrid HMM/ANN Speech Recognition Systems

Authors:

Giulia Bernardis, Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) (Switzerland)
Hervé Bourlard, Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) - Swiss Federal Institute of Technology (EPFL) (Switzerland)

Page (NA) Paper number 318

Abstract:

In this paper we define and investigate a set of confidence measures based on hybrid Hidden Markov Model/Artificial Neural Network acoustic models. These measures are using the neural network to estimate the local phone posterior probabilities, which are then combined and normalized in different ways. Experimental results will show that the use of an appropriate duration normalization is very important to obtain good estimates of the phone and word confidences. The different measures are evaluated at the phone and word levels on both isolated word (PHONEBOOK) and continuous speech (BREF) recognition tasks. It will be shown that one of those confidence measures is well suited for utterance verification, and that (as one could expect) confidence measures at the word level perform better than those at the phone level. Finally, using the resulting approach on PHONEBOOK to rescore the N-best list is shown yielding a 34% decrease in word error rate.

SL980318.PDF (From Author) SL980318.PDF (Rasterized)

TOP


Two-Pass Utterance Verification Algorithm for Long Natural Numbers Recognition

Authors:

Javier Caminero, Telefonica I+D (Spain)
Eduardo López, ETSIT-UPM (Spain)
Luis A. Hernández, ETSIT-UPM (Spain)

Page (NA) Paper number 440

Abstract:

There are many Spontaneous Dialogue Recognition based applications like home-banking ones where long numbers recognition facilities are crucial to complete a request from the user. Rejection and Utterance Verification (UV) are difficult problems in these applications. In this contribution we improve our previously proposed UV procedure in order to increase the correction of recognition errors, to solve grammatical ambiguities from the user, and to make more efficient the rejection of misrecognized or out-of-vocabulary (OOV) utterances. In spite of the verification performance, the proposed algorithm complies with the real-time constrains which are mandatory in real applications. We evaluate our method and present recognition results from the long natural number recognition task of a real Data Driven application through the telephone line on a multilingual environment. Experimental results show that the proposed method obtains a significant reduction in terms of recognition errors and achieves an extraordinary low false acceptance rate in all cases for different languages.

SL980440.PDF (From Author) SL980440.PDF (Rasterized)

TOP


A*-Admissible Key-Phrase Spotting With Sub-Syllable Level Utterance Verification

Authors:

Berlin Chen, Institute of Information Science, Academia Sinica (Taiwan)
Hsin-Min Wang, Institute of Information Science, Academia Sinica (Taiwan)
Lee-Feng Chien, Institute of Information Science, Academia Sinica (Taiwan)
Lin-Shan Lee, The Department of CSIE, National Taiwan University (Taiwan)

Page (NA) Paper number 305

Abstract:

In this paper, we propose an A*-admissible key-phrase spotting framework, which needs little domain knowledge and is capable of extracting salient key-phrase fragments from an input utterance in real-time. There are two key features in our approach. Firstly, the acoustic models and the search framework are specially designed such that very high degree vocabulary flexibility can be achieved for any desired application tasks. Secondly, the search framework uses an efficient two-pass A* search to generate N-best key-phrase candidates and then several sub-syllable level verification functions are properly weighted and used to further improve the recognition accuracy. Experimental results show that the A*-admissible key-phrase spotting with sub-word level utterance method outperforms the baseline methods used in common approaches.

SL980305.PDF (From Author) SL980305.PDF (Rasterized)

TOP


Speaker-Independent Upfront Dialect Adaptation in a Large Vocabulary Continuous Speech Recognizer

Authors:

Volker Fischer, IBM Speech Systems, European Speech Research (Germany)
Yuqing Gao, IBM Research, Human Language Technologies (USA)
Eric Janke, IBM United Kingdom Laboratories (U.K.)

Page (NA) Paper number 233

Abstract:

Large vocabulary continuous speech recognition systems show a significant decrease in performance if a users pronunciation differs largely from those observed during system training. This can be considered as the main reason why most commercially available systems recommend - if not enforce - the individual end user to read an enrollment script for the speaker dependent reestimation of acoustic model parameters. Thus, the improvement of recognition rates for dialect speakers is an important issue both with respect to a broader acceptance and a more convenient or natural use of such systems. This paper compares different techniques that aim on a better speaker independent recognition of dialect speech in a large vocabulary continuous speech recognizer. The methods discussed comprise Bayesian adaptation and speaker clustering techniques and deal with both the availability and absence of dialect training material. Results are given for a case study that aims on the improvement of a German speech recognizer for Austrian speakers.

SL980233.PDF (From Author)

TOP


Word-Based Acoustic Confidence Measures for Large-Vocabulary Speech Recognition

Authors:

Asela Gunawardana, Microsoft Research (USA)
Hsiao-Wuen Hon, Microsoft Research (USA)
Li Jiang, Microsoft Research (USA)

Page (NA) Paper number 401

Abstract:

Word level confidence measures are of use in many areas of speech recognition. Comparing the hypothesized word score to the score of a 'filler' model has been the most popular confidence measure because it is highly efficient, and does not require a large amount of training data. This paper explores an extension of this technique which also compares the hypothesized word score to the scores of words that are commonly confused for it, while maintaining efficiency and the low demand for training data. The proposed method gives a 39% relative false accept rate reduction over the 'filler'- model baseline, at a false reject rate of 5%.

SL980401.PDF (From Author) SL980401.PDF (Rasterized)

TOP


Improved Utterance Rejection Using Length Dependent Thresholds

Authors:

Sunil K. Gupta, Bell Laboratories - Lucent Technologies (USA)
Frank K. Soong, Bell Laboratories - Lucent Technologies (USA)

Page (NA) Paper number 1040

Abstract:

In this paper, we propose to use an utterance length (duration) dependent threshold for rejecting an unknown input utterance with a general speech (garbage) model. A general speech model, comparing with more sophisticated anti-subword models, is a more viable solution to the utterance rejection problem for low-cost applications with stringent storage and computational constraints. However, the rejection performance using such a general model with a fixed, universal rejection threshold is in general worse than the anti-models with higher discriminations. Without adding complexities to the rejection algorithm, we propose to vary the rejection threshold according to the utterance length. The experimental results show that significant improvement in rejection performance can be obtained by using the proposed, length dependent rejection threshold over a fixed threshold. We investigate utterance rejection in a command phrase recognition task. The equal error rate, a good figure of merit for calibrating the performance of utterance verification algorithms, is reduced by almost 23% when the proposed length dependent threshold is used.

SL981040.PDF (From Author) SL981040.PDF (Rasterized)

TOP


Bayesian Constrained Frequency Warping HMMS for Speaker Normalisation

Authors:

Ching Hsiang Ho, The Queen's University of Belfast (Ireland)
Saeed Vaseghi, The Queen's University of Belfast (Ireland)
Aimin Chen, The Queen's University of Belfast (Ireland)

Page (NA) Paper number 370

Abstract:

This paper presents a Bayesian constrained frequency warping technique. The Bayesian approach provides for inclusion of the prior information of the frequency warping parameter and for adjusting the search range in order to obtain the best warping factor dependent on HMMs. We introduce novel frequency warping (FWP) HMMs which are different warped versions of HMMs. Instead of frequency warping of the input speech we warp the spectrum of the HMMs. This is equivalent to HMMs which have both time and frequency warping capabilities. Experimentally FWP HMMs outperform the conventional constrained frequency warping approach. Furthermore, the best warping factor is estimated in two stages, a coarse stage followed by a fine stage. This method efficiently gauges the optimal warping factor and normalises the FWP HMMs.

SL980370.PDF (From Author) SL980370.PDF (Rasterized)

TOP


An Evaluation of Keyword Spotting Performance Utilizing False Alarm Rejection Based on Prosodic Information

Authors:

Masaki Ida, OMRON Corporation (Japan)
Ryuji Yamasaki, OMRON Corporation (Japan)

Page (NA) Paper number 159

Abstract:

In this paper, we describe our effort in developing new method of false alarm rejection for keyword spotting type of speech recognition system that we have developed about a year ago. This false alarm rejection uses prosodic similarities, and works as posterior rescore basis. In keyword spotting, there is always false alarm problem. Here, we propose a technique to reject those false alarms using prosodic features. In Japanese, prosodic information is expressed in intonation form, while may of other languages is using stress accents. Therefore, it is easy to calculate prosodic information using fundamental frequency, so called F0, in our language. In our new keyword spotting engine, we get result by combining two scores. One is phonetic score calculated by front engine, and the other is pitch score calculated by post engine described in this paper. We have accomplished 13points improvement on keyword recognition accuracy using this method.

SL980159.PDF (From Author) SL980159.PDF (Rasterized)

TOP


Predictive Speaker Adaptation and Its Prior Training

Authors:

Dieu Tran, NEC Corporation (Japan)
Ken-ichi Iso, NEC Corporation (Japan)

Page (NA) Paper number 1089

Abstract:

In this paper, we propose a Predictive Speaker Adaptation technique (PSA) in which speaker dependent HMM(SD-HMM) for a new speaker is predicted using adaptation utterances and a speaker independent HMM(SI-HMM). The method requires a prior training in order to estimate the parameters of the prediction function. For this purpose, we first prepare many speaker's fully trained SD-HMMs and their adaptation utterances(same for all speakers). In addition many speaker-specific BW-HMMs are built from the SI-HMM by means of Baum-Welch re-estimation on the adaptation utterances. The model-pair SD-HMM and BW-HMM for each speaker is used as training example for the input and output of the prediction function to find the speaker-independent prediction parameters. During adaptation, estimation of the new speaker's SD-HMM is carried out from his BW-HMM with the predetermined parameters. 60,000-word recognition experiments reported a word error-rate reduction of 16% when only 10 adaptation words were used.

SL981089.PDF (From Author) SL981089.PDF (Rasterized)

TOP


Powerful Syllabic Fillers for General-Task Keyword-Spotting and Unlimited-Vocabulary Continuous-Speech Recognition

Authors:

Rachida El Méliani, INRS-Telecommunications (Canada)
Douglas O'Shaughnessy, INRS-Telecommunications (Canada)

Page (NA) Paper number 837

Abstract:

We choose to represent, unlike other teams, vocabulary words and out-vocabulary words with the same set of subword HMMs. Secondly we replace the classical one-phoneme transcription of fillers in the lexicon by a new, more powerful one-syllable transcription. As for the language model, the problem produced, in the case of unlimited-vocabulary continuous-speech recognition, by the lack of information on new words in the training corpus is solved through the use of the limited information we gathered on new words. The results obtained in general-task keyword spotting as well as unlimited-vocabulary continuous-speech recognition demonstrate the efficiency of the choice of a one-syllable transcription rather than a one-phoneme one. As for the results in unlimited-vocabulary continuous-speech recognition, the language model using information from words of frequency one is demonstrated to be a new promising method of determination of a language model for new words.

SL980837.PDF (From Author) SL980837.PDF (Rasterized)

TOP


Confidence Scoring for Speech Understanding Systems

Authors:

Christine Pao, MIT Lab for Computer Science (USA)
Philipp Schmid, MIT Lab for Computer Science (USA)
James R. Glass, MIT Lab for Computer Science (USA)

Page (NA) Paper number 392

Abstract:

This research investigates the use of utterance-level features for confidence scoring. Confidence scores are used to accept or reject user utterances in our conversational weather information system. We have developed an automatic labeling algorithm based on a semantic frame comparison between recognized and transcribed orthographies. We explore recognition-based features along with semantic, linguistic, and application-specific features for utterance rejection. Discriminant analysis is used in an iterative process to select the best set of classification features for our utterance rejection sub-system. Experiments show that we can correctly reject over 60% of incorrectly understood utterances while accepting 98% of all correctly understood utterances.

SL980392.PDF (From Author) SL980392.PDF (Rasterized)

TOP


Phonological Rules for Enhancing Acoustic Enrollment of Unknown Words

Authors:

Bhuvana Ramabhadran, IBM T. J. Watson Research Center (USA)
Abraham Ittycheriah, IBM T. J. Watson Research Center (USA)

Page (NA) Paper number 534

Abstract:

Phonetic baseforms are the basic recognition units in most speech recognition systems. These baseforms are usually determined by linguists once a vocabulary is chosen and not modified thereafter. However, several applications, such as name dialing, require the user be able to add new words to the vocabulary. These new words are often names, or task-specific jargon, that have user-specific pronunciations. This paper describes a novel method for generating phonetic transcriptions (baseforms) of words based on acoustic evidence alone. It does not require any prior acoustic representation of the new word, is vocabulary independent, and uses phonological rules in a post processing stage to enhance the quality of the baseforms thus produced. Our experiments demonstrate the high decoding accuracies obtained when baseforms deduced using this approach are incorporated into our speech recognizer. Our experiments also compare the use of acoustic models that are trained on task-specific data with models trained for a general purpose ( to do digit, names, large vocabulary recognition, etc.), for generating phonetic transcriptions.

SL980534.PDF (From Author) SL980534.PDF (Rasterized)

TOP


Recognition-Based Word Counting for Reliable Barge-in and Early Endpoint Detection in Continuous Speech Recognition

Authors:

Anand R. Setlur, Lucent Technologies (USA)
Rafid A. Sukkar, Lucent Technologies (USA)

Page (NA) Paper number 168

Abstract:

In this paper, we present a word counting method that enables speech recognition systems to perform reliable barge-in detection and also make a fast and accurate determination of end of speech. This is achieved by examining partial recognition hypotheses and imposing certain "word stability" criteria. Typically, a voice activity detector is used for both barge-in detection and end of speech determination. We propose augmenting the voice activity detector with this more reliable recognition-based method. Experimental results for a connected digit task show that this approach is more robust for supporting barge-in since it is less prone to interrupting the announcement when extraneous speech input is encountered. Also, by using the early endpoint decision criterion, average response times are sped up 75% for this connected digit task.

SL980168.PDF (From Author) SL980168.PDF (Rasterized)

TOP


Linear Discriminant - A New Criterion For Speaker Normalization

Authors:

Martin Westphal, Interactive Systems Labs (Germany)
Tanja Schultz, Interactive Systems Labs (Germany)
Alex Waibel, Interactive Systems Labs (USA)

Page (NA) Paper number 755

Abstract:

In Vocal Tract Length Normalization (VTLN) a linear or nonlinear frequency transformation compensates for different vocal tract lengths. Finding good estimates for the speaker specific warp parameters is a critical issue. Despite good results using the Maximum Likelihood criterion to find parameters for a linear warping, there are concerns using this method. We searched for a new criterion that enhances the interclass separability in addition to optimizing the distribution of each phonetic class. Using such a criterion Linear Discriminant Analysis determines a linear transformation in a lower dimensional space. For VTLN, we keep the dimension constant and warp the training samples of each speaker such that the Linear Discriminant is optimized. Although that criterion depends on all training samples of all speakers it can iteratively provide speaker specific warp factors. We discuss how this approach can be applied in speech recognition and present first results on two different recognition tasks.

SL980755.PDF (From Author) SL980755.PDF (Rasterized)

TOP


Confidence Measures Derived from an Acceptor HMM

Authors:

Gethin Williams, University of Sheffield (U.K.)
Steve Renals, University of Sheffield (U.K.)

Page (NA) Paper number 644

Abstract:

In this paper we define a number of confidence measures derived from an acceptor HMM and evaluate their performance for the task of utterance verification using the North American Business News (NAB) and Broadcast News (BN) corpora. Results are presented for decodings made at both the word and phone level which show the relative profitability of rejection provided by the diverse set of confidence measures. The results indicate that language model dependent confidence measures have reduced performance on BN data relative to that for the more grammatically constrained NAB data. An explanation linking the observations that rejection is more profitable for noisy acoustics, for a reduced vocabulary and at the phone level is also given.

SL980644.PDF (From Author) SL980644.PDF (Rasterized)

TOP


Telephone Speech Multi-Keyword Spotting Using Fuzzy Search Algorithm and Prosodic Verification

Authors:

Chung-Hsien Wu, National Cheng Kung University (China)
Yeou-Jiunn Chen, National Cheng Kung University (China)
Yu-Chun Hung, National Cheng Kung University (China)

Page (NA) Paper number 218

Abstract:

In this paper a fuzzy search algorithm is proposed to deal with the recognition error for telephone speech. Since the prosodic information is a very special and important feature for Mandarin speech, we integrate the prosodic information into keyword verification. For multi-keyword detection, we define a keyword relation and a weighting function for reasonable keyword combinations. In the keyword recognizer, 94 INITIAL and 38 FINAL context-dependent Hidden Markov Models (HMM's) are used to construct the phonetic recognizer. For prosodic verification, a total of 175 context-dependent HMM's and five anti-prosodic HMM's are used. In this system, 1275 faculty names and department names are selected as the keywords. Using a test set of 3595 conversional speech utterance from 37 speakers (21 male, 16 female), the proposed fuzzy search algorithm and prosodic verification can reduce the error rate from 17.64% to 11.29% for multiple keywords embedded in non-keyword speech.

SL980218.PDF (From Author) SL980218.PDF (Rasterized)

TOP


Topic Recognition for News Speech Based on Keyword Spotting

Authors:

Yoichi Yamashita, Dep. of Computer Science, Ritsumeikan University (Japan)
Toshikatsu Tsunekawa, I.S.I.R., Osaka University (Japan)
Riichiro Mizoguchi, I.S.I.R., Osaka University (Japan)

Page (NA) Paper number 23

Abstract:

This paper describes topic identification for Japanese TV news speech based on the keyword spotting technique. Three thousands of nouns are selected as keywords which contribute to topic identification, based on criterion of mutual information and a length of the word. This set of the keywords identified the correct topic for 76.3% of articles from newspaper text data. Further, we performed keyword spotting for TV news speech and identified the topics of the spoken message by calculating possibilities of the topics in terms of an acoustic score of the spotted word and a topic probability of the word. In order to neutralize effect of false alarms, bias of the topics in the keyword set is removed. Topic identification rate is 66.5% assuming that identification is correct if the correct topic is included in the top three topics. The removal of the bias improved the identification rate by 6.1%.

SL980023.PDF (From Author) SL980023.PDF (Rasterized)

TOP