Authors:
Daniel Azzopardi, BT Labs (U.K.)
Shahram Semnani, BT Labs (U.K.)
Ben Milner, BT Labs (U.K.)
Richard Wiseman, BT Labs (U.K.)
Page (NA) Paper number 543
Abstract:
A combination of techniques for increasing recognition accuracy has
been developed for an automated corporate directory system with 120,000
entries. Using a traditional recogniser an accuracy of around 60% has
previously been obtained for both a 156 town name task and 1108 road
name task. Techniques presented in this paper comprise front-end modifications,
context dependent models, improved lexicon and noise modelling. This
resulted in an increased recognition accuracy of around 90%.
Authors:
Aruna Bayya, Rockwell Semiconductor Systems, Newport Beach, CA (USA)
Page (NA) Paper number 572
Abstract:
In this paper, we propose a new rejection criterion applicable specifically
to limited-training speech recognition systems such as Speaker-Dependent
(SD) recognition systems. The new criterion uses confidence measures
as well as speaker-specific out-of-vocabulary (OOV) models. The OOV
models are created from the same training data that is available to
create the in-vocabulary (IV) word models. We describe the method for
creating these speaker-specific out-of-vocabulary models from limited
training data. We also define a fairly robust confidence measure to
reject the OOV words. The results presented in this paper demonstrate
the effectiveness of the new criterion in a SD recognition task under
various conditions.
Authors:
Ruxin Chen, SONY Research Labs, San Jose (USA)
Miyuki Tanaka, SONY Research Labs, San Jose (USA)
Duanpei Wu, SONY Research Labs, San Jose (USA)
Lex Olorenshaw, SONY Research Labs, San Jose (USA)
Mariscela Amador, SONY Research Labs, San Jose (USA)
Page (NA) Paper number 583
Abstract:
This paper reports on a large vocabulary speaker independent isolated
word recognizer targeting 50,000 words. The system supports a unique
four-layer sharing structure for either continuous HMM or discrete
HMM. Evaluation is performed using a dictionary of 5000 US city names,
a dictionary of the 5000 English most frequent words, a dictionary
of 50,000 English words, and the 110,000 word CMU English dictionary.
For these dictionaries, recognition accuracy ranges from 90% to 93%
for the top 3 results.
Authors:
Rathinavelu Chengalvarayan, Lucent Technologies (USA)
Page (NA) Paper number 22
Abstract:
This paper presents a new technique for modelling heterogeneous data
sources such as speech signals received via distinctly different channels
which arises when an automatic speech recognition is deployed in wireless
telephony in which highly heterogenous channels coexist and interoperate.
The key problem is that a simple model may become inadequate to describe
accurately the diversity of the signal, resulting in an unsatisfactory
recognition performance. To cope up with this problem, different hybrid
modelling techniques have been proposed and investigated in this paper
by intelligently combining models from two different wireline and wireless
environments.
Authors:
Jae-Seung Choi, LG Corporate Institute of Technology (Korea)
Jong-Seok Lee, LG Corporate Institute of Technology (Korea)
Hee-Youn Lee, LG Corporate Institute of Technology (Korea)
Page (NA) Paper number 623
Abstract:
For large vocabulary recognition system, as well as for flexible vocabulary
applications using hidden Markov model(HMM), parameter smoothing and
tying have been used to increase the reliability of models. This paper
describes bottom-up and top-down clustering techniques for state level
tying. This paper also describes a method of applying parameter smoothing
to the clustered states and covariance matrix of semicontinuous hidden
Markov model(SCHMM). We applied co-occurrence smoothing method(CSM)
for senone smoothing. We present a new parameter smoothing method
and apply it to the distribution of discrete hidden Markov model(DHMM)
in the training procedure. A new model composition method for unseen
triphone modeling in bottom-up clustering is also proposed and compared
with traditional context-independent model backing-off method.
Authors:
Javier Ferreiros, Grupo de Tecnología del Habla. Departamento de Ingeniería Electrónica. ETSI Telecomunicacion. Universidad Politecnica de Madrid (Spain)
Javier Macías-Guarasa, Grupo de Tecnología del Habla. Departamento de Ingeniería Electrónica. ETSI Telecomunicacion. Universidad Politecnica de Madrid (Spain)
Ascensión Gallardo, Grupo de Tecnología del Habla. Departamento de Ingeniería Electrónica. ETSI Telecomunicacion. Universidad Politecnica de Madrid (Spain)
José Colás, Grupo de Tecnología del Habla. Departamento de Ingeniería Electrónica. ETSI Telecomunicacion. Universidad Politecnica de Madrid (Spain)
Ricardo Córdoba, Grupo de Tecnología del Habla. Departamento de Ingeniería Electrónica. ETSI Telecomunicacion. Universidad Politecnica de Madrid (Spain)
José Manuel Pardo, Grupo de Tecnología del Habla. Departamento de Ingeniería Electrónica. ETSI Telecomunicacion. Universidad Politecnica de Madrid (Spain)
Luis Villarrubia, Telefonica Investigacion y Desarrollo (Spain)
Page (NA) Paper number 987
Abstract:
At ICSLP'96 we presented a flexible, large vocabulary, speaker independent,
isolated-word preselection system in a telephone environment, using
a two stage, bottom-up strategy. We achieved reasonable performance
in large and very large vocabulary tasks, ranging from 1200 to 10000
words. In this paper, we will describe recent studies we have carried
out on the system, aimed in two directions: handling of non speech
sounds in the speech signal (we consider lips, respiration and click
noises); and making the preselection lists dynamic in length, to reduce
computational load, in the average. In the first case, we want to model
non speech sounds, as these effects are crucial in real-life situations,
leading to wrong endpointing and increasing error rates. In the second,
we are interested in integrating any available system parameter to
calculate the preselection list length to use, having applied both
parametric and non parametric methods.
Authors:
Masakatsu Hoshimi, Matsushita Research Institute Tokyo, Inc. , Tohoku University (Japan)
Maki Yamada, Matsushita Research Institute Tokyo, Inc. (Japan)
Katsuyuki Niyada, Central Research Labs., Matsushita Electric Industrial Co., Ltd. (Japan)
Shozo Makino, Tohoku University (Japan)
Page (NA) Paper number 257
Abstract:
As an input method for rapidly spreading small portable information
devices, development of speaker independent speech recognition technology
which can be embedded on a single DSP is now urgently requested. We
have reported a speech recognition method using phoneme similarity
vector as a feature vector, which is quite robust for reduction of
precision of the feature parameter. We've also developed a recognition
board with a single DSP, which works 100-word vocabulary using only
the internal memory inside the DSP. [1][2] In this report, we propose
a new technique which makes our recognition method more robust, where
a newly introduced noise standard template together with traditional
phoneme standard templates for calculating phoneme similarity vector
realizes precise word-spotting. When the newly proposed noise robustness
method was tested with 100 isolated word vocabulary speech of 50 subjects,
recognition accuracy of 94.7% was obtained under various noisy environments.
Authors:
Fran H.L. Jian, Dept. Linguistics, University of Reading (U.K.)
Page (NA) Paper number 146
Abstract:
This paper addresses the difficulties associated with automatically
distinguishing the seven Taiwanese tones. The tone recogniser is an
essential component of any automatic speech recognition system customised
for tone languages such as Taiwanese. We show that it is difficult
to distinguish between the Taiwanese tones simply employing the fundamental
frequency contours and that the task is simplified by employing energy
contour features besides the fundamental frequency features. To allow
energy to be accommodated into the classification model an energy-contour
feature extraction approach is presented. The proposed approach is
inspired by the ADSR model employed in musical instrument synthesis
where the envelopes of complex sounds are modeled employing only a
few parameters. Our experiments demonstrate that the inclusion of
energy into the recognition model allows the seven Taiwanese tones
to be discriminated successfully. The paper also presents acoustical
measurements of the fundamental frequency and energy features described
Authors:
Finn Tore Johansen, Telenor R&D (Norway)
Page (NA) Paper number 889
Abstract:
This paper presents results from a number of flexible vocabulary recognition
experiments on the Norwegian SpeechDat(II) database. A common phoneme-based
recogniser design procedure is tested on five different tasks, and
for five different training sets. Results verify that reasonably accurate
recognisers can be built with the database, using standard HMM techniques.
They also quantify the importance of training set selection for small
and medium vocabulary tasks.
Authors:
Montri Karnjanadecha, Old Dominion University (USA)
Stephen A. Zahorian, Old Dominion University (USA)
Page (NA) Paper number 1110
Abstract:
Spectral/temporal segment features are adapted for isolated word recognition
and tested with the entire English alphabet set using Hidden Markov
Models. The ISOLET database from OGI and the HTK toolkit from Cambridge
university were used to test our feature extraction technique. With
our feature set we were able to achieve 97.3% recognition accuracy
on test data with one pass using a whole word based recognizer. Gaussian
noise was also added to evaluate robustness of the feature set. We
were able to obtain recognition accuracies of 49.6% and 84.3% at SNR
of -10dB and 0dB, respectively. Linear discriminant analysis was also
applied to the initial feature set for a number of feature configurations
and noise levels but, generally, the performance was not improved.
We conclude that the initial feature computations used are both very
efficient (best results obtained with 50 total features) and robust
in the presence of noise.
Authors:
Hisashi Kawai, KDD R&D Laboratories Inc. (Japan)
Norio Higuchi, KDD R&D Laboratories Inc. (Japan)
Page (NA) Paper number 694
Abstract:
This paper describes experimental results on whole word HMM-based speech
recognition of connected digits in Japanese collected through the telephone
network. The training data comprises 756860 digits uttered by 1963
speakers, while the testing data comprises 304212 digits uttered by
852 speakers. The best performance was a word error rate of 0.42% for
known length strings obtained using context dependent models. The word
error rate was measured as a function of the training data size. The
result showed that at least 3302 samples per speaker and 344 speakers
are necessary and sufficient for context independent training. Error
analysis was conducted on a fraction of the population bearing the
major part of recognition errors. The results suggested that such speakers
arise not simply from speaker characteristics but from a combination
of speaker characteristics and environmental conditions.
Authors:
Takuya Koizumi, Dept. of Information Science, Fukui University (Japan)
Shuji Taniguchi, Dept. of Information Science, Fukui University (Japan)
Kazuhiro Kohtoh, Dept. of Information Science, Fukui University (Japan)
Page (NA) Paper number 51
Abstract:
This paper deals with a subword-unit-based isolated word recognition
system with enhanced speaker-independency. The subword is defined as
a part of word whose central portion has rather stationary or time-invariant
short-time spectra with its portions near its ends having rapidly varying
short-time spectra. In this system each isolated word is decomposed
into a sequence of subwords, each of which is identified by means of
a particular semi-continuous hidden Markov model that is named a subword
HMM. Each isolated word is recognized by a particular set of concatenated
subword HMMs that is designated as a word HMM. Subword boundaries within
a word are detected by finding peaks of the magnitude of delta cepstral
vectors obtained from the word. The system attains average word recognition
rates over 87 % for a number of Japanese words uttered by ten native
male speakers.
Authors:
Tomohiro Konuma, Matsushita Research Institute Tokyo, Inc. (Japan)
Tetsu Suzuki, Matsushita Research Institute Tokyo, Inc. (Japan)
Maki Yamada, Matsushita Research Institute Tokyo, Inc. (Japan)
Yoshio Ohno, Matsushita Research Institute Tokyo, Inc. (Japan)
Masakatsu Hoshimi, Matsushita Research Institute Tokyo, Inc. Tohoku University (Japan)
Katsuyuki Niyada, Matsushita Electric Industrial Co.,Ltd. (Japan)
Page (NA) Paper number 198
Abstract:
We present constrained time alignment acoustic models based on phonetic
knowledge and a speaker independent speech recognition method using
our proposed models. Japanese syllable and isolated word recognition
experiments show that the models have robustness to intra- and inter-
speaker varieties such as acoustic diversity. Furthermore we experiment
with word recognition tests under the condition such as noise environments
and endpoints free matching, it reveals the feasibility of our proposed
models.
Authors:
Ki Yong Lee, Soongsil University (Korea)
Joohun Lee, Dong-Ah Broadcasting College (Korea)
Page (NA) Paper number 408
Abstract:
In this paper, a time domain approach for speech recognition is developed.
The nonstationary autoregressive (AR) hidden markov model (HMM) with
gain contour is proposed for modeling the statistical characteristics
of the speech signal. The parameter of nonstationary AR model was modeled
by the polynomial function with linear combination of M known basis
functions. In this proposed model, speech signal is blocked by samples
into fixed-length frames and modeled by nonstationary AR model controlled
by markov switching sequences at each frame. Given the HMM parameter
set of the speech, the gain-adapted recognition algorithm is developed
for speech recognition.
Authors:
Ren-yuan Lyu, Chang Gung University (Taiwan)
Yuang-jin Chiang, National Tsing Hua University (Taiwan)
Wen-ping Hsieh, National Tsing Hua University (Taiwan)
Page (NA) Paper number 80
Abstract:
In this paper, we apply context dependent phonetic modeling on the
task of large vocabulary (with 20 thousand words) Taiwanese multi-syllabic
word recognition. Considering the phonetic characteristics of Taiwanese,
the right context dependent (RCD) phones instead of the general tri-phones
are used. The RCDs are further clustered at the sub-phone or state
level using a decision tree with a set of context-split questions specially
designed for Taiwanese speech according to the acoustic/phonetic knowledge.
For the speaker dependent case, 7.18% word error rate is achieved.
A real-time prototype system implemented on a Pentium-II personal computer
running MS-Windows95/ NT is also shown to validate the approaches proposed
here.
Authors:
Kazuyo Tanaka, Electrotechnical Laboratory (Japan)
Hiroaki Kojima, Electrotechnical Laboratory (Japan)
Page (NA) Paper number 966
Abstract:
This paper proposes a speech recognition method alternative to the
conventional sample-based statistical methods which are characterized
by the necessity of large amounts of training speech data. To resolve
this type of heavy processing, the proposed method employs an intermediate
phonetic code system and the calculation of distance between phonetic
code sequences in symbolic domain. It realizes high efficiency when
compared with direct processing of acoustic correlates, although some
deterioration will be expected in recognition scores. We first describe
the distance calculation method and present specific procedures for
obtaining the intermediate code sequence from input utterances and
for spotting words using the calculation of distance in the symbolic
domain. Preliminary experiments were examined on isolated word recognition
and phrase spotting in continuous speech. Word recognition results
indicate that the recognition scores obtained by the proposed method
are comparable compared with ordinary phone-HMM-based speech recognition.
Authors:
York Chung-Ho Yang, Matsushita Electric Institute of Technology, Taipei (Taiwan)
June-Jei Kuo, Matsushita Electric Institute of Technology, Taipei (Taiwan)
Page (NA) Paper number 505
Abstract:
Phoneme-oriented input with synchronised auto-revision to pictographic
nature of written Chinese and Chinese speech recognition are two subjects
not often brought together in the same article nor even the same proposition.
This paper explores the growing relations between these two entities
and, in particular, investigates what is found in integration the Chinese
phonetic input with its auto-revision methods (Kuo; 1986, 1987, 1995,
1996) and Chinese isolated word, continuous speech recognition for
portable device such as mobile telephone. Chinese phonetic input with
a synchronised auto-revision approach integrates with a small size,
high recognition rate Chinese speech recognition kernel for DSP single
chip application will be introduced in this paper. Chinese phrase taxonomy
has been defined and the definition is ready to be obtained from the
system's dictionary.
|