Authors:
Akio Ando, NHK Sci. & Tech. Res. Labs. (Japan)
Akio Kobayashi, NHK Sci. & Tech. Res. Labs. (Japan)
Toru Imai, NHK Sci. & Tech. Res. Labs. (Japan)
Page (NA) Paper number 16
Abstract:
This paper describes a thesaurus-based class n-gram model for broadcast
news transcription. The most important issue concerned with class n-gram
models is how to develop a word classification. We construct a word
classification mapping based on a thesaurus so as to maximize the average
mutual information function on a training corpus. To examine the effectiveness
of the new method, we compare it with two our previous methods, in
which the same thesaurus is used but word-class mappings are determined
in the different manners. The new method achieved substantially lower
perplexity for 83 news transcription sentences broadcast on June 4,
1996.
Authors:
Sreeram V. Balakrishnan, Motorola Lexicus Division (USA)
Page (NA) Paper number 295
Abstract:
As speech recognition systems are increasingly applied to real world
problems, it is often desirable to use the same recognition engine
for a variety of tasks of differing complexity. This paper explores
the relationship between the complexity of the recognition task and
the best strategies for pruning the recognition search space. We examine
two types of task: 20000 word WSJ dictation, and phone book access
using a 60 word grammar. For both tasks we compare two strategies
for pruning the search space: absolute pruning, where the number of
hypotheses is controlled by eliminating ones that have a score less
than a fixed beamwidth below the best scoring hypothesis, and rank
based pruning, where hypotheses are ranked by score and all hypotheses
beneath a certain rank are eliminated. We present statistics characterizing
the behaviour of the recognizer under different pruning strategies
and show how the strategies affect error-rates.
Authors:
Dhananjay Bansal, SCS, Carnegie Mellon University (USA)
Mosur K. Ravishankar, SCS, Carnegie Mellon University. (USA)
Page (NA) Paper number 829
Abstract:
In this paper we describe two new confidence measures for estimating
the reliability of speech-to-text output: "Likelihood Dependence" and
"Neighborhood Dependence". Each word in the speech-to-text output for
a given utterance is annotated with these two measures. Likelihood
dependence for a given word occurrence indicates how critical that
word is to the overall utterance likelihood; i.e., how much worse is
the likelihood of the next best utterance if that word is eliminated
from the recognition. Neighborhood dependence measures how stable a
given word is when neighboring words are changed in the recognition.
We show that correct and incorrect words in the recognition behave
significantly differently with respect to these measures. We also show
that on the broadcast news task they perform better than some of the
existing, commonly used measures.
Authors:
Jerome R. Bellegarda, Apple Computer, Inc. (USA)
Page (NA) Paper number 134
Abstract:
The goal of multi-span language modeling is to integrate the various
constraints, both local and global, that are present in the language.
In this paper, local constraints are captured via the usual n-gram
approach, while global constraints are taken into account through the
use of latent semantic analysis. An integrative formulation is derived
for the combination of these two paradigms, resulting in an entirely
data-driven, multi-span framework for large vocabulary speech recognition.
Because of the inherent complementarity in the two types of constraints,
the performance of the integrated language model compares favorably
with the corresponding n-gram performance. On a subset of the Wall
Street Journal speaker-independent, 20,000-word vocabulary, continuous
speech task, we observed a reduction in perplexity of about 25%, and
a reduction in average error rate of about 15%.
Authors:
Rathinavelu Chengalvarayan, Lucent Technologies (USA)
Page (NA) Paper number 21
Abstract:
Previous studies showed that a significantly enhanced recognition performance
can be achieved by incorporating information about HMM duration along
with the cepstral parameters. The reestimation formula for the duration
parameters have been derived in the past using fixed segmentation during
K-means training and the duration statistics are always fixed throughout
the additional minimum string error (MSE) training process. In this
study, we update the duration parameters along with other model parameters
during discriminative training iterations. The convergence property
of the training property based on the MSE approach is investigated,
and experimental results on wireline connected digit recognition task
demonstrated a 6% word error rate reduction by using the newly trained
duration model parameters as compared to fixed duartion parameters
during MSE training.
Authors:
Noah Coccaro, University of Colorado at Boulder: Department of Computer Science (USA)
Daniel Jurafsky, University of Colorado at Boulder: Departments of Linguistics and Computer Science (USA)
Page (NA) Paper number 852
Abstract:
We introduce a number of techniques designed to help integrate semantic
knowledge with N-gram language models for automatic speech recognition.
Our techniques allow us to integrate Latent Semantic Analysis (LSA),
a word-similarity algorithm based on word co-occurrence information,
with N-gram models. While LSA is good at predicting content words
which are coherent with the rest of a text, it is a bad predictor of
frequent words, has a low dynamic range, and is inaccurate when combined
linearly with N-grams. We show that modifying the dynamic range, applying
a per-word confidence metric, and using geometric rather than linear
combinations with N-grams produces a more robust language model which
has a lower perplexity on a Wall Street Journal test-set than a baseline
N-gram model.
Authors:
Julio Pastor, Grupo de Tecnología del Habla - Departamento de Ingeniería Electrónica - E. T. S. I. Telecomunicación - Universidad Politécnica de Madrid (Spain)
José Colás, Grupo de Tecnología del Habla - Departamento de Ingeniería Electrónica - E. T. S. I. Telecomunicación - Universidad Politécnica de Madrid (Spain)
Rubén San-Segundo, Grupo de Tecnología del Habla - Departamento de Ingeniería Electrónica - E. T. S. I. Telecomunicación - Universidad Politécnica de Madrid (Spain)
José Manuel Pardo, Grupo de Tecnología del Habla - Departamento de Ingeniería Electrónica - E. T. S. I. Telecomunicación - Universidad Politécnica de Madrid (Spain)
Page (NA) Paper number 1108
Abstract:
Tag definition in stochastic language models (n-grams and n-pos) is
based on grouping together words with similar right and left context
behavior. A modification of the n-gram model using multi-tagged words
and unsupervised clustering was already introduced for French with
a corpus of millions of non-tagged words. We present a variation of
bi-pos language model where two tag sets are defined and assigned to
each word (multi-tagged model) using grammatical information. Each
tag set is based on different context behavior. We use linguistic
expert knowledge and a simple automatic clustering procedure to obtain
groups of words with similar left context behavior (first set of tags)
and with similar right context (second set of tags). We propose a
grammatical based model useful when no big text corpus is available
and a performance increase has been observed when multi-tagged words
are used because of its better adaptation to the language.
Authors:
Vassilis Digalakis, Technical University of Crete (Greece)
Leonardo Neumeyer, SRI International (USA)
Manolis Perakakis, Technical University of Crete (Greece)
Page (NA) Paper number 940
Abstract:
We follow the paradigm that we have previously introduced for the encoding
of the recognizer parameters in a client-server model used for recognition
over wireless networks and the WWW, trying to maximize recognition
performance instead of perceptual reproduction. We present a new encoding
scheme for the mel frequency-warped cepstral parameters (MFCCs) that
uses product-code vector quantization, and we find that the required
bit rate to achieve the recognition performance of high-quality unquantized
speech is just 2000 bits per second. We also investigate the effect
of additive noise on the recognition performance when quantized features
are used, and we find that a small increase in the bit rate can provide
the necessary robustness.
Authors:
Bernard Doherty, The Queen's University of Belfast (Ireland)
Saeed Vaseghi, The Queen's University of Belfast (Ireland)
Paul McCourt, The Queen's University of Belfast (Ireland)
Page (NA) Paper number 323
Abstract:
This paper presents a novel method for modeling phonetic context using
linear context transforms. Initial investigations have shown the feasibility
of synthesising context dependent models from context independent models
through weighted interpolation of the peripheral states of a given
hidden markov model with its adjacent model. This idea can be further
extended, to maximum likelihood estimation of not only single weights,
but a matrix of weights or a transform. This paper outlines the application
of Maximum Likelihood Linear Regression (MLLR) as a means of modeling
context dependency in continuous density Hidden Markov Models (HMM).
Authors:
Michael T. Johnson, Purdue University (USA)
Mary P. Harper, Purdue University (USA)
Leah H. Jamieson, Purdue University (USA)
Page (NA) Paper number 871
Abstract:
The research presented here focuses on implementation and efficiency
issues associated with the use of word graphs for interfacing acoustic
speech recognition systems with natural language processing systems.
The effectiveness of various pruning methods for graph construction
is examined, as well as techniques for word graph compression. In
addition, the word graph representation is compared to another predominant
interface method, the N-best sentence list.
Authors:
Photina Jaeyoun Jang, School of Computer Science, Carnegie Mellon University (USA)
Alexander G. Hauptmann, School of Computer Science, Carnegie Mellon University (USA)
Page (NA) Paper number 934
Abstract:
We propose an unsupervised learning algorithm that learns hierarchical
patterns of word sequences in spoken language utterances. It extracts
cluster rules from training data based on high n-gram language model
probabilities to cluster words or segment a sentence. Cluster trees,
similar to parse trees, are constructed from the learned cluster rules.
This hierarchical clustering adds grammatical structure onto a traditional
trigram language model. The learned cluster rules are used to rescore
and improve the n-best utterance hypothesis list which is output by
a speech recognizer based on acoustic and trigram language model scores.
Our hierarchical cluster language model was trained on TREC broadcast
news data from 1995 and 1996, and reduced word error rate on the HUB-4
1997 broadcast news development set by 0.3% absolute. Prior symbolic
knowledge in the form of rules can also be incorporated by simply applying
the rules to the training data before the applicable learning iteration.
Authors:
Atsuhiko Kai, Toyohashi University of Technology (Japan)
Yoshifumi Hirose, Toyohashi University of Technology (Japan)
Seiichi Nakagawa, Toyohashi University of Technology (Japan)
Page (NA) Paper number 785
Abstract:
In this study, we investigate the effectiveness of an unknown word
processing(UWP) algorithm, which is incorporated into an N-gram language
model based speech recognition system for dealing with filled pauses
and out-of-vocabulary(OOV) words. We have already been investigated
the effect of the UWP algorithm, which utilizes a simple subword sequence
decoder, in a spoken dialog system using a context free grammar(CFG)
as a language model. The effect of the UWP algorithm was investigated
using an N-based continuous speech recognition system on both a small
dialog task and a large-vocabulary read speech dictation task. The
experiment results showed that the UWP improves the recognition accuracy
and an N-gram based system with the UWP can improve the understanding
performance in compared with a CFG-based system.
Authors:
Tetsunori Kobayashi, Waseda University (Japan)
Yosuke Wada, Waseda University (Japan)
Norihiko Kobayashi, Waseda University (Japan)
Page (NA) Paper number 708
Abstract:
Information source extension is utilized to improve the language model
for large vocabulary continuous speech recognition (LVCSR). McMillan's
theory, source extension make the model entropy close to the real source
entropy, implies that the better language model can be obtained by
source extension (making new unit through word concatenations and using
the new unit for the language modeling). In this paper, we examined
the effectiveness of this source extension. Here, we tested two methods
of source extension: frequency-based extension and entropy-based extension.
We tested the effect in terms of perplexity and recognition accuracy
using Mainichi newspaper articles and JNAS speech corpus. As the results,
the bi-gram perplexity is improved from 98.6 to 70.8 and tri-gram perplexity
is improved from 41.9 to 26.4. The bigram-based recognition accuracy
is improved from 79.8% to 85.3%.
Authors:
Akio Kobayashi, NHK Sci. & Tech. Res. Labs. (Japan)
Kazuo Onoe, NHK Sci. & Tech. Res. Labs. (Japan)
Toru Imai, NHK Sci. & Tech. Res. Labs. (Japan)
Akio Ando, NHK Sci. & Tech. Res. Labs. (Japan)
Page (NA) Paper number 973
Abstract:
This paper presents two linguistic techniques to improve broadcast
news transcription. The first one is an adaptation of a language model
which reflects current news content. It is based on a weighted mixture
of long-term news scripts and latest scripts as training data. The
mixture weights are given by the EM algorithm for linear interpolation
and then normalized by their text sizes. Not only n-grams but also
the vocabulary are updated by the latest news. We call it the Time
Dependent Language Model (TDLM). It achieved a 4.4% reduction in perplexity
and 0.7% improvement in word accuracy over the baseline language model.
The second technique is correction of the decoded transcriptions by
their corresponding electronic draft scripts. The corresponding drafts
are found by using a sentence similarity measure between them. Parts
to be considered as recognition errors are replaced with the original
drafts. This post-correction led to a 6.7% improvement in word accuracy.
Authors:
Jacques Koreman, University of the Saarland, Institute of Phonetics (Germany)
William J. Barry, University of the Saarland, Institute of Phonetics (Germany)
Bistra Andreeva, University of the Saarland, Institute of Phonetics (Germany)
Page (NA) Paper number 548
Abstract:
Three cross-language ASR experiments which use hidden Markov modelling
are described. Their goal is to process the signal so that it better
exploits its linguistically relevant properties for consonant identification.
Experiment 1 shows that consonant identification improves when vowel
transitions are used. Particularly the consonants' place of articulation
is identified better, because the vowel transitions contain formant
trajectories which depend on the consonant's place of articulation.
Experiment 2 shows that mapping acoustic parameters onto phonetic
features before applying hidden Markov modelling greatly improves consonant
identification rates. In experiment 3, the acoustic parameters from
the vowel transitions are also mapped onto consonantal (not vocalic!)
features ("relational processing"), as are the acoustic parameters
belonging to the consonants. The additional use of vowel transitions
does not further improve consonant identification, however. This is
probably due to undertraining of the vowel transitions in the Kohonen
network.
Authors:
Raymond Lau, MIT Laboratory for Computer Science (USA)
Stephanie Seneff, MIT Laboratory for Computer Science (USA)
Page (NA) Paper number 53
Abstract:
Previously, we introduced the ANGIE framework for modelling speech
where morphological and phonological substructures of words are jointly
characterized by a context-free grammar and represented in a multi-layered
hierarchical structure. We also demonstrated a phonetic recognizer
and a word-spotter based on ANGIE. In this work, we extend ANGIE to
a competitive continuous speech recognition system. Furthermore, given
that ANGIE is based on a context-free framework, we have decided to
combine ANGIE with TINA, a context-free based framework for natural
language understanding, into an integrated system. The integration
led to a 21.7% reduction in word error rate compared to a baseline
word bigram recognizer on ATIS. We also examined the addition of new
words to the vocabulary, an area we believe will benefit from both
ANGIE and the ANGIE-plus-TINA integration. The combination reduced
error rate by 20.8% over the baseline and outperformed several other
configurations tested not involving an integrated ANGIE-plus-TINA.
Authors:
Lalit R. Bahl, I.B.M. (USA)
S. De Gennaro, I.B.M. (USA)
P. De Souza, I.B.M. (USA)
E. Epstein, I.B.M. (USA)
J.M. Le Roux, I.B.M. (USA)
B. Lewis, I.B.M. (USA)
C. Waast, I.B.M. (USA)
Page (NA) Paper number 114
Abstract:
In French the pronunciations of many words change dramatically depending
on the word immediately preceding it. The result of this phenomenon,
known as "liaison", in an ASR system that does not model "liaison"
is the requirement of unnatural pronunciation and much user dissatisfaction.
We present, in this paper, the development of an acoustic model which
takes into account the wide variability of word pronunciations caused
by the liaison, the integration of this model into a French continuous
speech recognition system and decoding results.
Authors:
Fu-Hua Liu, IBM Watson Research Center (USA)
Michael Picheny, IBM Watson Research Center (USA)
Page (NA) Paper number 838
Abstract:
In this paper we describe a novel approach to address the issue of
different sampling frequencies in speech recognition. When a recognition
task needs a different sampling frequency from that of the reference
system, it is customary to re-train the system for the new sampling
rate. To circumvent the tedious training process, we propose a new
approach termed Sampling Rate Transformation (SRT) to perform the transformation
directly on speech recognition system. By re-scaling the mel-filter
design and filtering the system in spectrum domain, SRT converts the
existing system to the target spectral range. New systems are obtained
without using any data from the test environment. SRT reduces the word
error rate from 29.89% to 18.17% given 11KHz test data and a 16KHz
SI system. The matched system for 11KHz has an error rate of 16.17%.
We also examine MLLR and MAP. The best result from MLLR is 17.92% with
4.5 hours of speech. Similar improvements are also observed in the
speaker adaptation mode.
Authors:
Kristine Ma, GTE/BBN Technologies (USA)
George Zavaliagkos, GTE/BBN Technologies (USA)
Rukmini Iyer, GTE/BBN Technologies (USA)
Page (NA) Paper number 866
Abstract:
In this paper, we address the issue of deriving and using more realistic
pronunciations to represent words spoken in natural conversational
speech. Previous approaches include using automatic phoneme-based rule-learning
techniques, linguistic transformation rules, and phonetically hand-labelled
corpus to expand the number of pronunciation variants per word. While
rule-based approaches have the advantage of being easily extensible
to infrequent or unobserved words, they suffer from the problem of
over generalization. Using hand-transcribed data, one can obtain a
more concise set of new pronunciations but it cannot be extended to
unobserved or infrequently occuring words. In this paper, we adopt
the hand-labelled corpus scheme to improve pronunciations for frequent
multi and single words occurring in the training data, while using
the rule-based techniques to learn pronunciation variants and their
weights for the infrequent words. Furthermore, we experiment with
a new approach for speaker-dependent pronunciation modeling. The newly
expanded dictionaries are evaluated on the Switchboard and Callhome
corpora, giving a slight reduction in word recognition error rate.
Authors:
Sankar Basu, IBM T.J. Watson Research center (USA)
Abraham Ittycheriah, IBM T.J. Watson Research center (USA)
Stéphane Maes, IBM T.J. Watson Research center (USA)
Page (NA) Paper number 983
Abstract:
When shifting by a few samples a speech signal, we have observed significant
variations of the feature vectors produced by the acoustic front-end.
Furthermore, these utterances when decoded with a continuous speech
recognition system leads to dramatically different word error rates.
This paper analyzes the phenomena and illustrates the well known result
that classical acoustic front end processors including spectrum and
cepstra based techniques suffer from time-shift. After describing the
effect of sample sized shifts on the spectral estimates of the signal,
we propose several techniques which take advantage of shift variations
to multiply the amount of training that speech utterances can provide.
Eventually, we illustrate how it is possible to slightly modify the
acoustic front-end to render the recognizer invariant to small shifts.
Authors:
José B. Mariño, Universitat Politècnica de Catalunya (Spain)
Pau Pachès-Leal, Universitat Politècnica de Catalunya (Spain)
Albino Nogueiras, Universitat Politècnica de Catalunya (Spain)
Page (NA) Paper number 250
Abstract:
The performances of the demiphone (a context dependent subword unit
that models independently the left and the right parts of a phoneme)
and the triphone are compared. Continuous density hidden Markov modeling
for both types of units is tested with the HTK software using decision-tree
state clustering. The speech material is taken from the SpeechDat Spanish
database, composed by continuous speech utterances recorded through
the public telephone network. The training corpus is speaker and task
independent. Two testing sets are tried: isolated words corresponding
to speaker names, city names and phonetically rich words; and numbers
of Spanish identification cards and dates. The main conclusion is that
the demiphone simplifies the recognition system and yields a better
performance than the triphone. This result may be explained by the
ability of the demiphone to provide an excellent tradeoff between a
detailed coarticulation modeling and a proper parameter estimation.
Authors:
Shinsuke Mori, Tokyo Research Laboratory, IBM Japan (Japan)
Masafumi Nishimura, Tokyo Research Laboratory, IBM Japan (Japan)
Nobuyasu Itoh, Tokyo Research Laboratory, IBM Japan (Japan)
Page (NA) Paper number 989
Abstract:
In this paper we describe a word clustering method for class-based
n-gram model. The measurement for clustering is the entropy on a corpus
different from the corpus for n-gram model estimation. The search method
is based on the greedy algorithm. We applied this method to a Japanese
EDR corpus and English Penn Treebank corpus. The perplexities of word-based
n-gram model on EDR corpus and Penn Treebank are 153.1 and 203.5 respectively.
And Those of class-based n-gram model, estimated through our method,
are 146.4 and 136.0 respectively. The result tells us that our clustering
methods is better than the Brown's method and the Ney's method called
leaving-one-out.
Authors:
João P. Neto, INESC/IST (Portugal)
Ciro Martins, INESC/IST (Portugal)
Luís B. Almeida, INESC/IST (Portugal)
Page (NA) Paper number 562
Abstract:
Due to the enormous development of large vocabulary, speaker-independent
continuous speech recognition systems, which occur essentially for
the US English language, there is a large demand of this kind of systems
for other languages. In this paper we present the work done in the
development of a large vocabulary, speaker-independent continuous speech
recognition hybrid system for the European Portuguese language. This
is a difficult task due to the basic development stage of this technology
in the European Portuguese language. The development of a system of
this kind for a new language depends on the availability of the appropriate
source components, mainly a speech corpus and large amounts of texts.
This work became possible due to the development of a new database
(BD-PUBLICO), a large vocabulary speech corpus for the European Portuguese
language developed by us over the last two years.
Authors:
Mukund Padmanabhan, IBM T. J. Watson Research Center (USA)
Bhuvana Ramabhadran, IBM T. J. Watson Research Center (USA)
Sankar Basu, IBM T. J. Watson Research Center (USA)
Page (NA) Paper number 210
Abstract:
In this paper we describe a new testbed for developing speech recognition
algorithms - a VoiceMail transcription task, analogous to other tasks
such as the Switchboard, CallHome, and the Hub 4 tasks, which are currently
used by speech recognition researchers. We describe (i) the use of
compound words to model co-articulation effects in commonly occurring
words (ii) the use of linguistically derived phonological (that model
phenomena such as degemination, palatization etc) for other words (iii)
a new model-complexity adaptation technique that uses a discriminant
measure to allocate gaussians to the mixtures modelling the acoustic
units (allophones) (iv) experiments in using different feature extraction
methods (v) we also investigated the efficacy of some well known acoustic
adaptation techniques on this task. We then reported experimental
results that showed that most of the modelling techniques we investigated
were useful in reducing the word error rate - from 87% (when decoded
with Switchboard acoustic and language models) to 38%.
Authors:
Sira Palazuelos, Laboratorio de Tecnologias de Rehabilitacion. ETSI de Telecomunicacion. Universidad Politecnica de Madrid (Spain)
Santiago Aguilera, Laboratorio de Tecnologias de Rehabilitacion. ETSI de Telecomunicacion. Universidad Politecnica de Madrid (Spain)
Jose Rodrigo, Laboratorio de Tecnologias de Rehabilitacion. ETSI de Telecomunicacion. Universidad Politecnica de Madrid (Spain)
Juan Godino, Laboratorio de Tecnologias de Rehabilitacion. ETSI de Telecomunicacion. Universidad Politecnica de Madrid (Spain)
Page (NA) Paper number 381
Abstract:
In this paper, we describe and evaluate recent work and results achieved
in a word prediction system for Spanish, applying both statistical
and grammatical methods: word pairs and trios, bipos and tripos models,
and a stochastic context free grammar. The predictor is included in
several software applications, to enhance the typing rate for users
with motorical problems, who can only use a switch for writing. These
users have difficulties in writing, and usually in communicating with
other people, and the inclusion of word prediction in the system allow
them to increase their typing rate from 2-3 words per minute up to
8-10 words per minute.
Authors:
Kishore Papineni, IBM TJ Watson Research Center (USA)
Satya Dharanipragada, IBM TJ Watson Research Center (USA)
Page (NA) Paper number 559
Abstract:
Consider generating phonetic baseforms from orthographic spellings.
Availability of a segmentation (grouping) of the characters can be
exploited to achieve better phonetic translation. We are interested
in building segmentation models without using explicit segmentation
or alignment information during training. The heart of our segmentation
algorithm is a conditional probabilistic model that predicts whether
there are less, equal, or more phones than characters in the word.
We use just this contraction-expansion information on whole words for
training the model. The model has three components: a prior model,
a set of features, and weights of the features. The features are selected
and weights assigned in maximum entropy framework. Even though the
model is trained on whole words, we effectively localize it on substrings
to induce segmentation of the word to be segmented. Segmentation is
also aided by considering substrings in both forward and backward directions.
Authors:
Adam Berger, Carnegie Mellon University (USA)
Harry Printz, IBM (USA)
Page (NA) Paper number 679
Abstract:
We describe a large-scale investigation of dependency grammar language
models. Our work includes several significant departures from earlier
studies, notably a larger training corpus, improved model structure,
different feature types, new feature selection methods, andmore coherent
training and test data. We report word error rate (WER) results of
a speech recognition experiment, in which we used these models to rescore
the output of the IBM speech recognition system.
Authors:
Ganesh N. Ramaswamy, I.B.M. Research Center (USA)
Harry Printz, I.B.M. Research Center (USA)
Ponani S. Gopalakrishnan, I. B. M. Research Center (USA)
Page (NA) Paper number 611
Abstract:
In this paper, we propose a new bootstrap technique to build domain-dependent
language models. We assume that a seed corpus consisting of a small
amount of data relevant to the new domain is available, which is used
to build a reference language model. We also assume the availability
of an external corpus, consisting of a large amount of data from various
sources, which need not be directly relevant to the domain of interest.
We use the reference language model and a suitable metric, such as
the perplexity measure, to select sentences from the external corpus
that are relevant to the domain. Once we have a sufficient number of
new sentences, we can rebuild the reference language model. We then
continue to select additional sentences from the external corpus, and
this process continues to iterate until some satisfactory termination
point is achieved. We also describe several methods to further enhance
the bootstrap technique, such as combining it with mixture modeling
and class-based modeling. The performance of the proposed approach
was evaluated through a set of experiments, and the results are discussed.
Analysis of the convergence properties of the approach and the conditions
that need to be satisfied by the external corpus and the seed corpus
are highlighted, but detailed work on these issues is deferred for
the future.
Authors:
Joan-Andreu Sánchez, DSIC-Universidad Politecnica de Valencia (Spain)
José-Miguel Benedi, DSIC-Universidad Politecnica de Valencia (Spain)
Page (NA) Paper number 1032
Abstract:
The use of the Inside-Outside algorithm for the estimation of the probability
distributions of Stochastic Context-Free Grammars in Natural-Language
processing is restricted due to the time complexity per iteration and
the large number of iterations that it needs to converge. Alternatively,
an algorithm based on the Viterbi score (VS) is used. This VS algorithm
converges more rapidly, but obtains less competitive models. We propose
a new algorithm that only considers the k-best derivations in the estimation
process. The time complexity per iteration of the algorithm is practically
the same as that of the VS algorithm. The proposed algorithm has been
proven with a part of the Wall Street Journal task processed in the
Penn Treebank project. The perplexity of a test set for the VS algorithm
was 24.22 and this value was 22.73 for the proposed algorithm with
k= 3, which means an improvement of 6.15%.
Authors:
Ananth Sankar, SRI International (USA)
Page (NA) Paper number 194
Abstract:
We present two different approaches for robust estimation of the parameters
of context-dependent hidden Markov models (HMMs) for speech recognition.
The first approach, the Gaussian Merging-Splitting (GMS) algorithm,
uses Gaussian splitting to uniformly distribute the Gaussians in acoustic
space, and merging so as to compute only those Gaussians that have
enough data for robust estimation. We show that this method is more
robust than our previous training technique. The second approach, called
tied-transform HMMs, uses maximum-likelihood transformation-based acoustic
adaptation algorithms to transform a small HMM to a much larger HMM.
Since the transforms are shared or tied among Gaussians in the larger
HMM, robust estimation is achieved. We show that this approach gives
a significant improvement in recognition accuracy and a dramatic reduction
in memory needed to store the models.
Authors:
Kristie Seymore, Carnegie Mellon University (USA)
Stanley Chen, Carnegie Mellon University (USA)
Ronald Rosenfeld, Carnegie Mellon University (USA)
Page (NA) Paper number 897
Abstract:
Topic adaptation for language modeling is concerned with adjusting
the probabilities in a language model to better reflect the expected
frequencies of topical words for a new document. We present a novel
technique for adapting a language model to the topic of a document,
using a nonlinear interpolation of n-gram language models. A three-way,
mutually exclusive division of the vocabulary into general, on-topic
and off-topic word classes is used to combine word predictions from
a topic-specific and a general language model. We achieve a slight
decrease in perplexity and speech recognition word error rate on a
Broadcast News test set using these techniques. Our results are compared
to results obtained through linear interpolation of topic models.
Authors:
Kazuyuki Takagi, The University of Electro-Communications (Japan)
Rei Oguro, The University of Electro-Communications (Japan)
Kenji Hashimoto, The University of Electro-Communications (Japan)
Kazuhiko Ozeki, The University of Electro-Communications (Japan)
Page (NA) Paper number 26
Abstract:
This paper reports our work to improve a bigram language model for
Japanese TV broadcast news speech recognition. First, frequent word
strings were grouped into phrases in order that the phrases were added
to the lexicon as new units of recognition. The test set perplexity
was improved when frequent function word strings were used as additional
recognition units. The speech recognition performance was improved
both by grouping function word strings and by grouping compound nouns
that were selected by word association ratio. Secondly, in order to
alleviate the OOV problem related with nouns, we built and tested a
language model that allows switching its noun lexicon according to
the domain of the article to be recognized next.
Authors:
Hesham Tolba, INRS-Telecommunications (Canada)
Douglas O'Shaughnessy, INRS-Telecommunications (Canada)
Page (NA) Paper number 342
Abstract:
In this paper, the implementation of a robust front-end to be used
for a large-vocabulary Continuous Speech Recognition (CSR) system based
on a Voiced-Unvoiced (V-U) decision has been addressed. Our approach
is based on the separation of the speech signal into voiced and unvoiced
components. Consequently, speech enhancement can be achieved through
processing of the voiced and the unvoiced components separately. Enhancement
of the voiced component is performed using an adaptive comb filtering,
whereas the unvoiced component is enhanced using the modified spectral
subtraction approach. We proved via experiments that the proposed CSR
system is robust in additive noisy environments (SNR down to 0 dB).
Authors:
Juan Carlos Torrecilla, Telefonica Investigacion Y Desarrollo (Spain)
Ismael Cortázar, Telefonica Investigacion Y Desarrollo (Spain)
Luis A. Hernández, Telefonica Investigacion Y Desarrollo (Spain)
Page (NA) Paper number 321
Abstract:
In this paper we proposed an efficient beam search procedure that combines
well-known search techniques as a lexicon organization using tree-structured
grammars with a novel approach of using different types of subword
units depending on the local scores of the active words. An efficient
double-tree structure using phonemes and triphones is presented. Experimental
results on an isolated word recognition systems reveals that the proposed
strategy is suitable for important reductions in computational cost
with only negligible increases in recognition errors. Tests over a
vocabulary of 955 Spanish words presents a 0.5% of increase in error
rate for a 32% reduction in the number of senones to be evaluated.
Authors:
Paul van Mulbregt, Dragon Systems, Inc. (USA)
Ira Carp, Dragon Systems, Inc. (USA)
Lawrence Gillick, Dragon Systems, Inc. (USA)
Steve Lowe, Dragon Systems, Inc. (USA)
Jon Yamron, Dragon Systems, Inc. (USA)
Page (NA) Paper number 116
Abstract:
Expertise in the automatic transcription of broadcast speech has progressed
to the point of being able to use the resulting transcripts for information
retrieval purposes. In this paper, we first describe a corpus of automatically
recognized broadcast news, a method for segmenting the broadcast into
stories, and finally apply this method to retrieve stories relating
to a specific topic. The method is based on Hidden Markov Models and
is in analogy with the usual implementation of HMMs in speech recognition.
Authors:
Philip O'Neill, The Queen's University of Belfast (Ireland)
Saeed Vaseghi, The Queen's University of Belfast (Ireland)
Bernard Doherty, The Queen's University of Belfast (Ireland)
Wooi Haw Tan, The Queen's University of Belfast (Ireland)
Paul McCourt, The Queen's University of Belfast (Ireland)
Page (NA) Paper number 178
Abstract:
The choice of speech unit affects the accuracy, complexity, expandability
and ease of adaptation of ASRs to speaker and environmental variations.
This paper explores a method of subword modelling based on the concept
of multi-phone strings. The motivation in using the longer duration
multi-phone strings is to reduce the loss of contextual information,
cross-phone correlation, and transitions. Multi-phone strings are an
alternative to context-dependent phones and they include many of the
syllables. An advantage of multi-phone units is the existence of more
than one valid multi-phone transcription for each monophone sequence,
this can be used to improve ASR accuracy. A particular case of multi-phone
strings namely phone-pairs is investigated in detail. Experimental
Evaluation on TIMIT and WSJCAM0 are presented.
Authors:
Nanette M. Veilleux, Boston University, Metropolitan College (USA)
Stefanie Shattuck-Hufnagel, Mass Institute of Technology, Research Lab of Electronics (USA)
Page (NA) Paper number 382
Abstract:
In a pilot study of phonetic modification of function words in 2 spontaneous
speech dialogues, 99 utterances of the syllable /tu/ corresponding
to the morphemes to, two, too, -to and to- included ten pronunciation
variants. Factors influencing phonetic modification included phonetic
context, prosody, part of speech, adjacent disfluency and individual
speaker. 11% of the acoustic landmarks defining /t/ closure, /t/ release
and vowel jaw opening maximum were not detectable in hand labelling.
In a separate corpus, 59% of recognition errors involved grammatical
or function words like conjunctions, articles, prepositions, pronouns
and auxilliary verbs, and for 17 tokens of /tu/, half were misrecognized.
Implications of these preliminary results for linguistic theory, cognitive
modelling of speech processing and automatic speech recognition are
discussed.
Authors:
Fuliang Weng, SRI, International (USA)
Andreas Stolcke, SRI, International (USA)
Ananth Sankar, SRI, International (USA)
Page (NA) Paper number 136
Abstract:
We describe two new techniques for reducing word lattice sizes without
eliminating hypotheses. The first technique is an algorithm to reduce
the size of non-deterministic bigram word lattices by merging redundant
nodes. On bigram word lattices generated from Hub4 Broadcast News
speech, this reduces lattice sizes by half on average. The second technique
is an improved algorithm for expanding lattices with trigram language
models. Backed-off trigram probabilities are encoded without node
duplication by factoring the probabilities into bigram probabilities
and backoff weights, and duplicating nodes only for explicit trigrams.
Experiments on Broadcast News show that this method reduces trigram
lattice sizes by a factor of 6, and reduces expansion time by more
than a factor of 10. Compared to conventionally expanded lattices,
recognition with the compactly expanded lattices was also found to
be 40% faster, without affecting recognition accuracy.
Authors:
Mirjam Wester, A2RT, University of Nijmegen (The Netherlands)
Judith M. Kessens, A2RT, University of Nijmegen (The Netherlands)
Helmer Strik, A2RT, University of Nijmegen (The Netherlands)
Page (NA) Paper number 371
Abstract:
This paper describes how the performance of a continuous speech recognizer
for Dutch has been improved by modeling pronunciation variation. We
used three methods to model pronunciation variation. First, within-word
variation was dealt with. Phonological rules were applied to the words
in the lexicon, thus automatically generating pronunciation variants.
Secondly, cross-word pronunciation variation was modeled using two
different approaches. The first approach was to model cross-word processes
by adding the variants as separate words to the lexicon and in the
second approach this was done by using multi-words. For each of the
methods, recognition experiments were carried out. A significant improvement
was found for modeling within-word variation. Furthermore, modeling
cross-word processes using multi-words leads to significantly better
results than modeling them using separate words in the lexicon.
Authors:
Edward W.D. Whittaker, Cambridge University (U.K.)
Philip C. Woodland, Cambridge University (U.K.)
Page (NA) Paper number 967
Abstract:
In this paper the main differences between language modelling of Russian
and English are examined. A Russian corpus and a comparable English
corpus are described. The effects of high inflectionality in Russian
and the relationship between the out-of-vocabulary rate and vocabulary
size are investigated. Standard word and class N-gram language modelling
techniques are applied to the two corpora and perplexity results are
reported. A novel approach to the modelling of inflected languages
is proposed and its efficacy compared with the other techniques.
Authors:
Petra Witschel, Siemens AG (Germany)
Page (NA) Paper number 471
Abstract:
Stochastic language models based on word n-grams require huge amount
of training material especially for large vocabulary systems. Using
n-grams based on classes much less training material is necessary and
higher coverage can be achieved. Building classes on basis of linguistic
characteristics (POS) has the advantage that new words can be assigned
easily. Until now for POS-based language models class sets have usually
been defined by linguistic experts. In this paper we present an approach
where for a given number of classes a class set is generated automatically
such that entropy of language model is minimized. We perform experiments
on German medical reports of about 1.2 million words of text and 24000
words of vocabulary. Using our approach we generate an exemplary class
set of 196 optimized POS-classes. Comparing the optimized POS-based
language model to the language model based on 196 normally defined
classes we get an improvement up to 10% in test set perplexity.
Authors:
Mark Wright, BT Laboratories (U.K.)
Simon Hovell, BT Laboratories (U.K.)
Simon Ringland, BT Laboratories (U.K.)
Page (NA) Paper number 450
Abstract:
Many of the pruning strategies used to remove less likely hypotheses
from the search space in large vocabulary speech recognition (LVR)
systems, have a peak search space many times greater than the average
search space. This paper discusses two such strategies used within
BT's speech recognition architecture, Step pruning and Histogram pruning.
Two-tier pruning is proposed as a simple but powerful extension applicable
to either of the above strategies. This seeks to limit the expansion
of the search space between the prune and acoustic match processes
without affecting accuracy. It is shown that the application of two-tier
pruning to either strategy reduces peak search effort, and results
in an average reduction in run time of 33% and 53% for step pruning
and histogram pruning respectively, with no loss in top-N accuracy.
Authors:
George Zavaliagkos, BBN Technologies (USA)
Man-Hung Siu, BBN Technologies (USA)
Thomas Colthurst, BBN Technologies (USA)
Jayadev Billa, BBN Technologies (USA)
Page (NA) Paper number 1007
Abstract:
This paper explores techniques for utilizing untranscribed training
data pools to increase the available training data for automatic speech
recognition systems. It has been well established that current speech
recognition technology, especially in Large Vocabulary Conversational
Speech Recognition (LVCSR), is largely language independent, and that
the dominant factor with regards to performance on a certain language
is the amount of available training data. The paper addresses this
need for increased training data by presenting ways to use untranscribed
acoustic data to increase the training data size and thus improve speech
recognition.
Authors:
Ea-Ee Jan, IBM Thomas J. Watson Research Center (USA)
Raimo Bakis, IBM Thomas J. Watson Research Center (USA)
Fu-Hua Liu, IBM Thomas J. Watson Research Center (USA)
Michael Picheny, IBM Thomas J. Watson Research Center (USA)
Page (NA) Paper number 862
Abstract:
Large vocabulary automatic speech recognition might assist hearing
impaired telephone users by displaying a transcription of the incoming
side of the conversation, but the system would have to achieve sufficient
accuracy on conversational-style, telephone-bandwidth speech. We describe
our development work toward such a system. This work comprised three
phases: Experiments with clean data filtered to 200-3500Hz, experiments
with real telephone data, and language model development. In the first
phase, the speaker independent error rate was reduced from 25% to 12%
by using MLLT, increasing the number of cepstral components from 9
to 13, and increasing the number of Gaussians from 30,000 to 120,000.
The resulting system, however, performed less well on actual telephony,
producing an error rate of 28.4%. By additional adaptation and the
use of an LDA and CDCN combination, the error rate was reduced to 19.1%.
Speaker adaptation reduces the error rate to 10.96%. These results
were obtained with read speech. To explore the language-model requirements
in a more realistic situation, we collected some conversational speech
with an arrangement in which one participant could not hear the conversation
but only saw recognizer output on a screen. We found that a mixture
of language models, one derived from the Switchboard corpus and the
other from prepared texts, resulted in approximately 10% fewer errors
than either model alone.
Authors:
Antonio Bonafonte, Universitat Politecnica de Catalunya (Spain)
José B. Mariño, Universitat Politecnica de Catalunya (Spain)
Page (NA) Paper number 1125
Abstract:
X-grams are a generalization of the n-grams, where the number of previous
conditioning words is different for each case and decided from the
training data. X-grams reduce perplexity with respect to trigrams and
need less number of parameters. In this paper, the representation of
the x-grams using finite state automata is considered. This representation
leads to a new model, the non-deterministic x-grams, an approximation
that is much more efficient, suffering small degradation on the modeling
capability. Empirical experiments for a continuous speech recognition
task show how, for each ending word, the number of transitions is reduced
from 1222 (the size of the lexicon) to around 66.
|