Authors:
M. Carmen Benítez, Universidad de Granada (Spain)
Antonio Rubio, Universidad de Granada (Spain)
Pedro García, Universidad de Granada (Spain)
Jesus Diaz-Verdejo, Universidad de Granada (Spain)
Page (NA) Paper number 1082
Abstract:
In this work we propose a novel way of discriminating the words that
are recognized by a speech recognition system as correctly or incorrectly
detected words. The procedure consists of the extraction of a set of
characteristics for each word. Utilizing these characteristics, we
have built two classifiers: the first one is a vector quantizer, while
the second one, though also a vector quantizer, was trained using adaptative
technique learning (LVQ). The results obtained show an improvement
in the performance of the recognizer achieved by reducing the number
of insertions with no significant reduction in the correctly detected
words.
Authors:
Giulia Bernardis, Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) (Switzerland)
Hervé Bourlard, Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) - Swiss Federal Institute of Technology (EPFL) (Switzerland)
Page (NA) Paper number 318
Abstract:
In this paper we define and investigate a set of confidence measures
based on hybrid Hidden Markov Model/Artificial Neural Network acoustic
models. These measures are using the neural network to estimate the
local phone posterior probabilities, which are then combined and normalized
in different ways. Experimental results will show that the use of an
appropriate duration normalization is very important to obtain good
estimates of the phone and word confidences. The different measures
are evaluated at the phone and word levels on both isolated word (PHONEBOOK)
and continuous speech (BREF) recognition tasks. It will be shown that
one of those confidence measures is well suited for utterance verification,
and that (as one could expect) confidence measures at the word level
perform better than those at the phone level. Finally, using the resulting
approach on PHONEBOOK to rescore the N-best list is shown yielding
a 34% decrease in word error rate.
Authors:
Javier Caminero, Telefonica I+D (Spain)
Eduardo López, ETSIT-UPM (Spain)
Luis A. Hernández, ETSIT-UPM (Spain)
Page (NA) Paper number 440
Abstract:
There are many Spontaneous Dialogue Recognition based applications
like home-banking ones where long numbers recognition facilities are
crucial to complete a request from the user. Rejection and Utterance
Verification (UV) are difficult problems in these applications. In
this contribution we improve our previously proposed UV procedure in
order to increase the correction of recognition errors, to solve grammatical
ambiguities from the user, and to make more efficient the rejection
of misrecognized or out-of-vocabulary (OOV) utterances. In spite of
the verification performance, the proposed algorithm complies with
the real-time constrains which are mandatory in real applications.
We evaluate our method and present recognition results from the long
natural number recognition task of a real Data Driven application through
the telephone line on a multilingual environment. Experimental results
show that the proposed method obtains a significant reduction in terms
of recognition errors and achieves an extraordinary low false acceptance
rate in all cases for different languages.
Authors:
Berlin Chen, Institute of Information Science, Academia Sinica (Taiwan)
Hsin-Min Wang, Institute of Information Science, Academia Sinica (Taiwan)
Lee-Feng Chien, Institute of Information Science, Academia Sinica (Taiwan)
Lin-Shan Lee, The Department of CSIE, National Taiwan University (Taiwan)
Page (NA) Paper number 305
Abstract:
In this paper, we propose an A*-admissible key-phrase spotting framework,
which needs little domain knowledge and is capable of extracting salient
key-phrase fragments from an input utterance in real-time. There are
two key features in our approach. Firstly, the acoustic models and
the search framework are specially designed such that very high degree
vocabulary flexibility can be achieved for any desired application
tasks. Secondly, the search framework uses an efficient two-pass A*
search to generate N-best key-phrase candidates and then several sub-syllable
level verification functions are properly weighted and used to further
improve the recognition accuracy. Experimental results show that the
A*-admissible key-phrase spotting with sub-word level utterance method
outperforms the baseline methods used in common approaches.
Authors:
Volker Fischer, IBM Speech Systems, European Speech Research (Germany)
Yuqing Gao, IBM Research, Human Language Technologies (USA)
Eric Janke, IBM United Kingdom Laboratories (U.K.)
Page (NA) Paper number 233
Abstract:
Large vocabulary continuous speech recognition systems show a significant
decrease in performance if a users pronunciation differs largely from
those observed during system training. This can be considered as the
main reason why most commercially available systems recommend - if
not enforce - the individual end user to read an enrollment script
for the speaker dependent reestimation of acoustic model parameters.
Thus, the improvement of recognition rates for dialect speakers is
an important issue both with respect to a broader acceptance and a
more convenient or natural use of such systems. This paper compares
different techniques that aim on a better speaker independent recognition
of dialect speech in a large vocabulary continuous speech recognizer.
The methods discussed comprise Bayesian adaptation and speaker clustering
techniques and deal with both the availability and absence of dialect
training material. Results are given for a case study that aims on
the improvement of a German speech recognizer for Austrian speakers.
Authors:
Asela Gunawardana, Microsoft Research (USA)
Hsiao-Wuen Hon, Microsoft Research (USA)
Li Jiang, Microsoft Research (USA)
Page (NA) Paper number 401
Abstract:
Word level confidence measures are of use in many areas of speech recognition.
Comparing the hypothesized word score to the score of a 'filler' model
has been the most popular confidence measure because it is highly efficient,
and does not require a large amount of training data. This paper explores
an extension of this technique which also compares the hypothesized
word score to the scores of words that are commonly confused for it,
while maintaining efficiency and the low demand for training data.
The proposed method gives a 39% relative false accept rate reduction
over the 'filler'- model baseline, at a false reject rate of 5%.
Authors:
Sunil K. Gupta, Bell Laboratories - Lucent Technologies (USA)
Frank K. Soong, Bell Laboratories - Lucent Technologies (USA)
Page (NA) Paper number 1040
Abstract:
In this paper, we propose to use an utterance length (duration) dependent
threshold for rejecting an unknown input utterance with a general speech
(garbage) model. A general speech model, comparing with more sophisticated
anti-subword models, is a more viable solution to the utterance rejection
problem for low-cost applications with stringent storage and computational
constraints. However, the rejection performance using such a general
model with a fixed, universal rejection threshold is in general worse
than the anti-models with higher discriminations. Without adding complexities
to the rejection algorithm, we propose to vary the rejection threshold
according to the utterance length. The experimental results show that
significant improvement in rejection performance can be obtained by
using the proposed, length dependent rejection threshold over a fixed
threshold. We investigate utterance rejection in a command phrase recognition
task. The equal error rate, a good figure of merit for calibrating
the performance of utterance verification algorithms, is reduced by
almost 23% when the proposed length dependent threshold is used.
Authors:
Ching Hsiang Ho, The Queen's University of Belfast (Ireland)
Saeed Vaseghi, The Queen's University of Belfast (Ireland)
Aimin Chen, The Queen's University of Belfast (Ireland)
Page (NA) Paper number 370
Abstract:
This paper presents a Bayesian constrained frequency warping technique.
The Bayesian approach provides for inclusion of the prior information
of the frequency warping parameter and for adjusting the search range
in order to obtain the best warping factor dependent on HMMs. We introduce
novel frequency warping (FWP) HMMs which are different warped versions
of HMMs. Instead of frequency warping of the input speech we warp the
spectrum of the HMMs. This is equivalent to HMMs which have both time
and frequency warping capabilities. Experimentally FWP HMMs outperform
the conventional constrained frequency warping approach. Furthermore,
the best warping factor is estimated in two stages, a coarse stage
followed by a fine stage. This method efficiently gauges the optimal
warping factor and normalises the FWP HMMs.
Authors:
Masaki Ida, OMRON Corporation (Japan)
Ryuji Yamasaki, OMRON Corporation (Japan)
Page (NA) Paper number 159
Abstract:
In this paper, we describe our effort in developing new method of false
alarm rejection for keyword spotting type of speech recognition system
that we have developed about a year ago. This false alarm rejection
uses prosodic similarities, and works as posterior rescore basis.
In keyword spotting, there is always false alarm problem. Here, we
propose a technique to reject those false alarms using prosodic features.
In Japanese, prosodic information is expressed in intonation form,
while may of other languages is using stress accents. Therefore, it
is easy to calculate prosodic information using fundamental frequency,
so called F0, in our language. In our new keyword spotting engine,
we get result by combining two scores. One is phonetic score calculated
by front engine, and the other is pitch score calculated by post engine
described in this paper. We have accomplished 13points improvement
on keyword recognition accuracy using this method.
Authors:
Dieu Tran, NEC Corporation (Japan)
Ken-ichi Iso, NEC Corporation (Japan)
Page (NA) Paper number 1089
Abstract:
In this paper, we propose a Predictive Speaker Adaptation technique
(PSA) in which speaker dependent HMM(SD-HMM) for a new speaker is predicted
using adaptation utterances and a speaker independent HMM(SI-HMM).
The method requires a prior training in order to estimate the parameters
of the prediction function. For this purpose, we first prepare many
speaker's fully trained SD-HMMs and their adaptation utterances(same
for all speakers). In addition many speaker-specific BW-HMMs are built
from the SI-HMM by means of Baum-Welch re-estimation on the adaptation
utterances. The model-pair SD-HMM and BW-HMM for each speaker is used
as training example for the input and output of the prediction function
to find the speaker-independent prediction parameters. During adaptation,
estimation of the new speaker's SD-HMM is carried out from his BW-HMM
with the predetermined parameters. 60,000-word recognition experiments
reported a word error-rate reduction of 16% when only 10 adaptation
words were used.
Authors:
Rachida El Méliani, INRS-Telecommunications (Canada)
Douglas O'Shaughnessy, INRS-Telecommunications (Canada)
Page (NA) Paper number 837
Abstract:
We choose to represent, unlike other teams, vocabulary words and out-vocabulary
words with the same set of subword HMMs. Secondly we replace the classical
one-phoneme transcription of fillers in the lexicon by a new, more
powerful one-syllable transcription. As for the language model, the
problem produced, in the case of unlimited-vocabulary continuous-speech
recognition, by the lack of information on new words in the training
corpus is solved through the use of the limited information we gathered
on new words. The results obtained in general-task keyword spotting
as well as unlimited-vocabulary continuous-speech recognition demonstrate
the efficiency of the choice of a one-syllable transcription rather
than a one-phoneme one. As for the results in unlimited-vocabulary
continuous-speech recognition, the language model using information
from words of frequency one is demonstrated to be a new promising method
of determination of a language model for new words.
Authors:
Christine Pao, MIT Lab for Computer Science (USA)
Philipp Schmid, MIT Lab for Computer Science (USA)
James R. Glass, MIT Lab for Computer Science (USA)
Page (NA) Paper number 392
Abstract:
This research investigates the use of utterance-level features for
confidence scoring. Confidence scores are used to accept or reject
user utterances in our conversational weather information system.
We have developed an automatic labeling algorithm based on a semantic
frame comparison between recognized and transcribed orthographies.
We explore recognition-based features along with semantic, linguistic,
and application-specific features for utterance rejection. Discriminant
analysis is used in an iterative process to select the best set of
classification features for our utterance rejection sub-system. Experiments
show that we can correctly reject over 60% of incorrectly understood
utterances while accepting 98% of all correctly understood utterances.
Authors:
Bhuvana Ramabhadran, IBM T. J. Watson Research Center (USA)
Abraham Ittycheriah, IBM T. J. Watson Research Center (USA)
Page (NA) Paper number 534
Abstract:
Phonetic baseforms are the basic recognition units in most speech recognition
systems. These baseforms are usually determined by linguists once a
vocabulary is chosen and not modified thereafter. However, several
applications, such as name dialing, require the user be able to add
new words to the vocabulary. These new words are often names, or task-specific
jargon, that have user-specific pronunciations. This paper describes
a novel method for generating phonetic transcriptions (baseforms) of
words based on acoustic evidence alone. It does not require any prior
acoustic representation of the new word, is vocabulary independent,
and uses phonological rules in a post processing stage to enhance the
quality of the baseforms thus produced. Our experiments demonstrate
the high decoding accuracies obtained when baseforms deduced using
this approach are incorporated into our speech recognizer. Our experiments
also compare the use of acoustic models that are trained on task-specific
data with models trained for a general purpose ( to do digit, names,
large vocabulary recognition, etc.), for generating phonetic transcriptions.
Authors:
Anand R. Setlur, Lucent Technologies (USA)
Rafid A. Sukkar, Lucent Technologies (USA)
Page (NA) Paper number 168
Abstract:
In this paper, we present a word counting method that enables speech
recognition systems to perform reliable barge-in detection and also
make a fast and accurate determination of end of speech. This is achieved
by examining partial recognition hypotheses and imposing certain "word
stability" criteria. Typically, a voice activity detector is used for
both barge-in detection and end of speech determination. We propose
augmenting the voice activity detector with this more reliable recognition-based
method. Experimental results for a connected digit task show that this
approach is more robust for supporting barge-in since it is less prone
to interrupting the announcement when extraneous speech input is encountered.
Also, by using the early endpoint decision criterion, average response
times are sped up 75% for this connected digit task.
Authors:
Martin Westphal, Interactive Systems Labs (Germany)
Tanja Schultz, Interactive Systems Labs (Germany)
Alex Waibel, Interactive Systems Labs (USA)
Page (NA) Paper number 755
Abstract:
In Vocal Tract Length Normalization (VTLN) a linear or nonlinear frequency
transformation compensates for different vocal tract lengths. Finding
good estimates for the speaker specific warp parameters is a critical
issue. Despite good results using the Maximum Likelihood criterion
to find parameters for a linear warping, there are concerns using this
method. We searched for a new criterion that enhances the interclass
separability in addition to optimizing the distribution of each phonetic
class. Using such a criterion Linear Discriminant Analysis determines
a linear transformation in a lower dimensional space. For VTLN, we
keep the dimension constant and warp the training samples of each speaker
such that the Linear Discriminant is optimized. Although that criterion
depends on all training samples of all speakers it can iteratively
provide speaker specific warp factors. We discuss how this approach
can be applied in speech recognition and present first results on two
different recognition tasks.
Authors:
Gethin Williams, University of Sheffield (U.K.)
Steve Renals, University of Sheffield (U.K.)
Page (NA) Paper number 644
Abstract:
In this paper we define a number of confidence measures derived from
an acceptor HMM and evaluate their performance for the task of utterance
verification using the North American Business News (NAB) and Broadcast
News (BN) corpora. Results are presented for decodings made at both
the word and phone level which show the relative profitability of rejection
provided by the diverse set of confidence measures. The results indicate
that language model dependent confidence measures have reduced performance
on BN data relative to that for the more grammatically constrained
NAB data. An explanation linking the observations that rejection is
more profitable for noisy acoustics, for a reduced vocabulary and at
the phone level is also given.
Authors:
Chung-Hsien Wu, National Cheng Kung University (China)
Yeou-Jiunn Chen, National Cheng Kung University (China)
Yu-Chun Hung, National Cheng Kung University (China)
Page (NA) Paper number 218
Abstract:
In this paper a fuzzy search algorithm is proposed to deal with the
recognition error for telephone speech. Since the prosodic information
is a very special and important feature for Mandarin speech, we integrate
the prosodic information into keyword verification. For multi-keyword
detection, we define a keyword relation and a weighting function for
reasonable keyword combinations. In the keyword recognizer, 94 INITIAL
and 38 FINAL context-dependent Hidden Markov Models (HMM's) are used
to construct the phonetic recognizer. For prosodic verification, a
total of 175 context-dependent HMM's and five anti-prosodic HMM's are
used. In this system, 1275 faculty names and department names are selected
as the keywords. Using a test set of 3595 conversional speech utterance
from 37 speakers (21 male, 16 female), the proposed fuzzy search algorithm
and prosodic verification can reduce the error rate from 17.64% to
11.29% for multiple keywords embedded in non-keyword speech.
Authors:
Yoichi Yamashita, Dep. of Computer Science, Ritsumeikan University (Japan)
Toshikatsu Tsunekawa, I.S.I.R., Osaka University (Japan)
Riichiro Mizoguchi, I.S.I.R., Osaka University (Japan)
Page (NA) Paper number 23
Abstract:
This paper describes topic identification for Japanese TV news speech
based on the keyword spotting technique. Three thousands of nouns
are selected as keywords which contribute to topic identification,
based on criterion of mutual information and a length of the word.
This set of the keywords identified the correct topic for 76.3% of
articles from newspaper text data. Further, we performed keyword spotting
for TV news speech and identified the topics of the spoken message
by calculating possibilities of the topics in terms of an acoustic
score of the spotted word and a topic probability of the word. In
order to neutralize effect of false alarms, bias of the topics in the
keyword set is removed. Topic identification rate is 66.5% assuming
that identification is correct if the correct topic is included in
the top three topics. The removal of the bias improved the identification
rate by 6.1%.
|