Authors:
Visarut Ahkuputra, Department of Electrical Engineering, Chulalongkorn University (Thailand)
Somchai Jitapunkul, Department of Electrical Engineering, Chulalongkorn University (Thailand)
Nutthacha Jittiwarangkul, Department of Electrical Engineering, Chulalongkorn University (Thailand)
Ekkarit Maneenoi, Department of Electrical Engineering, Chulalongkorn University (Thailand)
Sawit Kasuriya, Department of Electrical Engineering, Chulalongkorn University (Thailand)
Page (NA) Paper number 283
Abstract:
The recognition of ten Thai isolated numerals from zero to nine and
60 Thai polysyllabic words are compared between different recognition
techniques, namely, Neural Network, Modified Backpropagation Neural
Network, Fuzzy-Neural Network, and Hidden Markov Model. The 15-state
left-to-right discrete hidden markov model in cooperation with the
vector quantization technique has been studied and compared with the
multilayer perceptron neural network using the error backpropagation,
the modified backpropagation, and also with the fuzzy-neural network
with the same configuration. The recognition error on Thai isolated
numerals using the conventional neural network, the modified neural
network, the fuzzy-neural network, and the hidden markov model techniques
are 26.97 percent, 22.00 percent, 8.50 percent, and 15.75 percent respectively.
Authors:
Felix Freitag, UPC (Spain)
Enric Monte, UPC (Spain)
Page (NA) Paper number 455
Abstract:
This paper presents a speech recognition system which incorporates
predictive neural networks. The neural networks are used to predict
observation vectors of speech. The prediction error vectors are modeled
on the state level by Gaussian densities, which provide the local similarity
measure for the Viterbi algorithm during recognition. The system is
evaluated on a continuous speech phoneme recognition task. Compared
with a HMM reference system, the proposed system obtained better results
in the speech recognition experiments.
Authors:
Toshiaki Fukada, ATR-ITL (Japan)
Takayoshi Yoshimura, Nagoya Institute of Technology (Japan)
Yoshinori Sagisaka, ATR-ITL (Japan)
Page (NA) Paper number 658
Abstract:
We propose a method for automatically generating a pronunciation dictionary
based on a pronunciation neural network that can predict plausible
pronunciations (realized pronunciations) from canonical pronunciations.
This method can generate multiple forms of realized pronunciations
using the pronunciation network. Experimental results on spontaneous
speech show that the automatically-derived pronunciation dictionary
gives consistently higher recognition rates than a conventional dictionary.
Authors:
Stephen J. Haskey, Loughborough University (U.K.)
Sekharajit Datta, Loughborough University (U.K.)
Page (NA) Paper number 568
Abstract:
In this paper a comparative study between One-Class-One-Network (OCON)
and Multi-Layered Perceptron (MLP) neural networks for vowel phoneme
recognition is presented. The OCON architecture, first proposed by
I.C.Jou et al, is similar in design to a conventional feed-forward
MLP, only each class had its own dedicated sub-network containing a
single output node. Conventional MLPs usually consist of fully-connected
nodes which not only result in a large number of weighted connections
but also create the problem of cross-class interference. Using vowel
phoneme data from the DARPA TIMIT corpus of read speech, MLP and OCON
architectures were trained and the relative effects of recognition
and convergence rates during both intra and inter-class adaptation
tested. The OCON showed an increase in the convergence rate of 273%
and an improvement of adapted recognition rates against the MLP of
over 12%.
Authors:
John-Paul Hosom, Oregon Graduate Institute of Science and Technology (OGI) (USA)
Ronald A. Cole, Oregon Graduate Institute of Science and Technology (OGI) (USA)
Piero Cosi, Institute of Phonetics -- C. N. R. (Italy)
Page (NA) Paper number 613
Abstract:
This paper describes a set of experiments on neural-network training
and search techniques that, when combined, have resulted in a 54%
reduction in error on the continuous digits recognition task. The
best system had word-level accuracy of 97.52% on a test set of the
OGI 30K Numbers corpus, which contains naturally-produced continuous
digit strings recorded over telephone channels. Experiments investigated
effects of the feature set, the amount of data used for training, the
type of context-dependent categories to be recognized, the values for
duration limits, and the type of grammar. The experiments indicate
that the grammar and duration limits had a greater effect on recognition
accuracy than the output categories, cepstral features, or a 50% increase
in the amount of training data.
Authors:
Ying Jia, Lab of Interactive Information Systems, Institute of Acoustics, Chinese Academy of Science (China)
Limin Du, Lab of Interactive Information Systems, Institute of Acoustics, Chinese Academy of Science (China)
Ziqiang Hou, Institute of Acoustics, Chinese Academy of Science (China)
Page (NA) Paper number 415
Abstract:
To integrate the hierarchy structure of discrimination between all
HMM states for Chinese Initials and Finals, we constructed in this
paper Hierarchical Neural Networks (HNN), which differ from Jordan's
HME in such extensions as more complex parameterization for gate and/or
expert and dimension-reduced expert network. With these extensions,
we can reuse those pre-trained simple node networks in a hierarchy
structure (HNN), and fine-tune them jointly by Generalized Expectation
Maximization (GEM) algorithm. The proposed HNNs were used within hybrid
HMM-ANN models to perform the estimation of posterior probabilities
for HMM states. Instead of using a large monolithic neural network,
the HNN system can be trained in a short time compared with MLP estimator
and result in a speed-up in decoding time over the conventional systems.
We have applied the proposed hybrid HMM-HNN method to the recognition
task of Chinese Continuous Speech., achieve a promising word error
rate of 26.4%.
Authors:
Eric Keller, University of Lausanne (Switzerland)
Page (NA) Paper number 937
Abstract:
Feature representations mediating between acoustic input and symbolic
representation promise to reduce learning time needed for automatic
speech signal segmentation. Experiments are reported that circumscribe
simple acoustic inputs and appropriate feature sets for neural network
training. Stable and compatible solutions for English and French were
identified.
Authors:
Nikki Mirghafori, ICSI & UC Berkeley (USA)
Nelson Morgan, ICSI & UC Berkeley (USA)
Page (NA) Paper number 1150
Abstract:
Multi-band automatic speech recognition is a new and exploratory area
of speech recognition which has been getting much attention in the
research community. It has been shown that multi-band ASR reduces
word error in noisy conditions, particularly in the case of narrow
band noise. In this work we show that multi-band ASR could be used
to improve the speech recognition accuracy of natural numbers for clean
speech when the multi-band (MB) information stream is used in addition
to the full-band (FB) one. We also observe that a similar combination
method significantly reduces the error rate on reverberant speech.
Finally, we analyze the error patterns of the full-band and multi-band
paradigms to understand why the combination of the two streams is effective.
Authors:
Ednaldo B. Pizzolato, University of Essex (U.K.)
T. Jeff Reynolds, University of Essex (U.K.)
Page (NA) Paper number 821
Abstract:
Multinet is a connectionist architecture designed for certain difficult
multi-class pattern classification tasks. These are characterised by
very large input feature spaces, rendering a monolithic classifier
impractical. The architecture consists of a layer with at least one
primary 'detector' for each class, followed by a combining net which
estimates the posterior probabilities for all classes. Typically primary
detectors only input a subset of the input features. Thus the architecture
decomposes classification in two ways: by class and by factoring of
the input space dimensions. Multinet incorporates the ideas of Modular
Neural Networks and Ensembles. In this paper, we investigate the use
of Multinet on standard HMM and hybrid HMM-NN systems that we run on
the same tasks. The value and potential of the Multinet approach is
shown by detailing successive improvements to the Multinet system which
are easily obtained because of the modularity of the architecture.
Authors:
Tomio Takara, University of the Ryukyus (Japan)
Yasushi Iha, University of the Ryukyus (Japan)
Itaru Nagayama, University of the Ryukyus (Japan)
Page (NA) Paper number 1066
Abstract:
The hidden Markov models (HMMs) are widely used for automatic speech
recognition because they have a powerful algorithm used in estimating
the model's parameters, and also achieve a high performance. Once a
structure of the model is given, the model's parameters are obtained
auto- matically by feeding training data. However, there is still an
unresolved problem with the HMM, i.e. how to design an optimal HMM
structure. In answer to this problem, we proposed the application of
a genetic algorithm (GA) to search out such an optimal structure, and
we showed this method to be effective for isolated word recognition.
However, the test of this method was restricted to discrete HMMs. In
this paper, we propose a new application of the GA to the continuous
HMM (CHMM) which is thought to be more effective than the discrete
HMM. We report the results of our experiment showing the effectiveness
of the genetic algorithm in automatic speech recognition.
Authors:
Dat Tran, University of Canberra (Australia)
Michael Wagner, University of Canberra (Australia)
Tu Van Le, University of Canberra (Australia)
Page (NA) Paper number 797
Abstract:
In vector quantisation (VQ) based speaker recognition, the minimum
overall average distortion rule is used as a criterion to assign a
given sequence of acoustic vectors to a speaker model known as a codebook.
An alternative decision rule based on fuzzy c-means clustering is proposed
in this paper. A set of membership functions associated with vectors
for codebooks are defined as discriminant functions and the maximum
overall average membership function rule is stated. The theoretical
analysis and the experimental results show that this rule can be used
in both speaker identification and speaker verification. It is more
effective than the minimum overall average distortion rule.
Authors:
Dat Tran, University of Canberra (Australia)
Tu Van Le, University of Canberra (Australia)
Michael Wagner, University of Canberra (Australia)
Page (NA) Paper number 798
Abstract:
A fuzzy clustering based modification of Gaussian mixture models (GMMs)
for speaker recognition is proposed. In this modification, fuzzy mixture
weights are introduced by redefining the distances used in the fuzzy
c-means (FCM) functionals. Their reestimation formulas are proved by
minimising the FCM functionals. The experimental results show that
the fuzzy GMMs can be used in speaker recognition and it is more effective
than the GMMs in tests on the TI46 database.
Authors:
Chai Wutiwiwatchai, Department of Electrical Engineering, Chulalongkorn University (Thailand)
Somchai Jitapunkul, Department of Electrical Engineering, Chulalongkorn University (Thailand)
Visarut Ahkuputra, Department of Electrical Engineering, Chulalongkorn University (Thailand)
Ekkarit Maneenoi, Department of Electrical Engineering, Chulalongkorn University (Thailand)
Sudaporn Luksaneeyanawin, Department of Linguistics, Chulalongkorn University (Thailand)
Page (NA) Paper number 349
Abstract:
In this research, a new strategy of Fuzzy-Neural Network system was
proposed for Thai numeral speech recognition. Instead of using the
fuzzy membership input with class membership desired-output during
training procedure as proposed by several researches, we used the fuzzy
membership input with fundamental binary desired-output. This can reduce
the misunderstood training, decrease the training time and also improve
the recognition ability. The system was tested on the Thai ten-numeral
speech (0-9) recognition. The error rate for speaker-independent test
achieved 9.2% compared to 14% error rate of conventional neural network
system while the error rate of the system using class membership desired-output
is quite high because of misunderstood training.
Authors:
Chai Wutiwiwatchai, Department of Electrical Engineering, Chulalongkorn University (Thailand)
Somchai Jitapunkul, Department of Electrical Engineering, Chulalongkorn University (Thailand)
Visarut Ahkuputra, Department of Electrical Engineering, Chulalongkorn University (Thailand)
Ekkarit Maneenoi, Department of Electrical Engineering, Chulalongkorn University (Thailand)
Sudaporn Luksaneeyanawin, Department of Linguistics, Chulalongkorn University (Thailand)
Page (NA) Paper number 350
Abstract:
In this research, the Fuzzy-Neural Network (fuzzy-NN) model was proposed
for Speaker-Independent Thai polysyllabic word recognition. Various
fuzzy membership functions on linguistic properties were used to convert
exact features extracted from input speech to the fuzzy membership
values. The fuzzy membership values were arranged to be new input vector
of Multilayer Perceptron (MLP) neural network. The binary desired outputs
were used during training. 70 Thai words consist of ten numerals, the
others were single-syllable, double-syllable and triple-syllable, 20
words in each group, were used for system evaluation. In order to
improve recognition accuracy, number of syllable and tonal level detected
were conducted for speech preclassification. The Pi fuzzy membership
function provided the best recognition accuracy among other functions;
Trapezoidal, and Triangular function. Under an optimal condition,
the achieved recognition error rates were 5.6% on dependent test and
6.7% on independent test, which were respectively 3.3% and 3.4% decreasing
from the conventional Neural Network system.
|