Tomio Takara, University of the Ryukyus (Japan)
Kazuya Higa, University of the Ryukyus (Japan)
Itaru Nagayama, University of the Ryukyus (Japan)
Hidden Markov models (HMMs) are widely used for automatic speech recognition because they have a powerful algorithm used in estimating the models' parameters, and achieve a high performance. Once a structure of the model is given, the model's parameters are obtained automatically by feeding training data. There is, however, no effective design method leading to an optimal structure of HMMs. In this paper, we propose a new application of a genetic algorithm to search out such an optimal structure. In this method, the left-right structures are adopted for HMMs and the likelihood is used for the fitness of the genetic algorithm. We report the results of our experiment showing the effectiveness of the genetic algorithm in automatic speech recognition.
Satoshi Takahashi, NTT HI Labs. Japan (Japan)
Kiyoaki Aikawa, NTT HI Labs. Japan (Japan)
Shigeki Sagayama, NTT HI Labs. Japan (Japan)
This paper proposes a new type of acoustic model called the discrete mixture HMM (DMHMM). As large scale speech databases have been constructed for speaker-independent HMMs, continuous mixture HMMs (CMHMMs) are needed to increase the number of mixture components in order to represent complex distributions. This leads to a high computational cost for calculating output probabilities. The DMHMM represents the feature parameter space by using the mixtures of multivariate distributions in the same way as the diagonal covariance CMHMM. Instead of using Gaussian mixtures to represent feature distributions in each dimension, the DMHMM uses the mixtures of the discrete distributions based on the scalar quantization (SQ). Since the discrete distribution has a higher degree-of-freedom in terms of representation, the DMHMM is advantageous in representing the feature distributions efficiently with fewer mixture components. In isolated word recognition experiments for telephone speech, we have found that the DMHMM outper
Luciano Fissore, CSELT (Italy)
Franco Ravera, CSELT (Italy)
Pietro Laface, DAI - Politecnico di Torino (Italy)
Isolated word speech recognizers with fixed vocabularies are often used to provide vocal services through the telephone line. The paper illustrates a simple post processing approach that allows the hypotheses produced by a HMM recognizer to be rescored taking into account the global temporal structure of the pronounced words. Our approach does not directly rely on state/word duration modeling. It models, instead, the global time variations of the spectral features of each word and their correlation in time: two important perceptual cues that are only partially exploited by standard HMMs. Results are presented for isolated word speaker independent systems with vocabulary of different size and complexity. We show that the recognition rate improves not only for small vocabulary recognition systems such as the isolated digit one, but also for a 475 city name vocabulary used in a vocal service that provides information about the main railway connections.
Zhihong Hu, CSLU,OGI (U.S.A.)
Etienne Barnard, CSLU,OGI (U.S.A.)
Dynamic modeling of speech is potentially a major improvement on Hidden Markov Models (HMMs). In one approach, trajectory models are used to model the dynamics of the spectrum, and are used as basis for classification. Although some improvement has been achieved in this way, one would hope for more substantial improvements given that the independence assumption is removed. One reason why this was not achieved may be that the trajectory models are based on cepstral coefficients; we show that these tracks contain spurious oscillations. This suggests that these trajectory features might have a high within-class variance. We introduce a measure of evaluating the smoothness of trajectory-based features. This measure provides a method of selecting the best of a set of similar features. Formant trajectories prove to be significantly smoother than trajectories of mel scale cepstral coefficients (MFCC) by this measure, but this does not translate directly to improved performance.
Srinivasan Umesh, City University Hunter College, New York (U.S.A.)
L. Cohen, City University Hunter College, New York (U.S.A.)
D. Nelson, City University Hunter College, New York (U.S.A.)
Recently, we have proposed the use of scale-cepstral coefficients as features in speech recognition. We have developed a corresponding frequency-warping function, such that, in the warped domain the formant envelopes of different speakers are approximately translated versions of one and another for any given vowel. These methods were motivated by a desire to achieve speaker-normalization. In this paper, we point out to very interesting parallels of the various steps in computing the scale-cepstrum, with those observed in computing features based on physiological models of the auditory system or psychoacoustic experiments. It may therefore be useful to have a better understanding of the need for the various signal-processing steps which may result in the development of more robust recognizers.
Su-Lin Wu, ICSI / UC Berkeley (U.S.A.)
Michael L. Shire, ICSI / UC Berkeley (U.S.A.)
Steven Greenberg, ICSI / UC Berkeley (U.S.A.)
Nelson Morgan, ICSI / UC Berkeley (U.S.A.)
In this paper we examine the proposition that knowledge of the timing of syllabic onsets may be useful in improving the performance of speech recognition systems. A method of estimating the location of syllable onsets derived from the analysis of energy trajectories in critical band channels has been developed, and a syllable-based decoder has been designed and implemented that incorporates this onset information into the speech recognition process. For a small, continuous speech recognition task the addition of artificial syllabic onset information (derived from advance knowledge of the word transcriptions) lowers the word error rate by 38%. Incorporating acoustically-derived syllabic onset information reduces the word error rate by 10% on the same task. The latter experiment has highlighted representational issues on coordinating acoustic and lexical syllabifications, a topic we are beginning to explore.
Philipp Schmid, Oregon Graduate Inst. (U.S.A.)
Etienne Barnard, Oregon Graduate Inst. (U.S.A.)
We demonstrate the use of explicit formant features for vowel and semi--vowel classification. The formant trajectories are approximated by either three line segments or Legendre polynomials. Together with formant amplitude, formant bandwidth, pitch, and segment duration, these formant features form a compact feature representation which performs as well (71.8%) as a cepstral--based feature representation (71.6%). The combination of the formant and cepstral feature improves the accuracy further to 73.4%. Additionally, we outline future experiments using our robust, N--best formant tracker.
Jayadev Billa, EE Dept. University of Pittsburgh (U.S.A.)
In this paper we propose a new approach to the modeling of speech based on cues from the peripheral auditory system. Our approach attempts to incorporate the dynamic adaptation of biological auditory systems to varying sound by simplistically formulating a dual-processing strategy that treats unvoiced and voiced speech as deserving of different processing. Preliminary studies show that this approach possesses significant noise robustness.
Régine André-Obrecht, IRIT (France)
Bruno Jacob, IRIT (France)
Our work deals with the classical problem of merging heterogenous and asynchronous parameters. It's well known that lips reading improves the speech recognition score, specially in noise condition; so we study more precisely the modeling of acoustic and labial parameters to propose two Automatic Speech Recognition Systems: - a Direct Identification is performed by using a classical HMM approach: no correlation between visual and acoustic parameters is assumed. - two correlated models: a master HMM and a slave HMM, process respectively the labial observations and the acoustic ones. To assess each approach, we use a segmental pre-processing. Our task is the recognition of spelled french letters, in clear and noisy (coktail party) environments. Whatever the approach and condition, the introduction of labial features improves the performances, but the difference between the two models isn't enough sufficient to provide any priority.
Thierry Soulas, France Telecom, CNET (France)
Chafic Mokbel, France Telecom, CNET (France)
Denis Jouvet, France Telecom, CNET (France)
Jean Monné, France Telecom, CNET (France)
In this work, environment adaptation is studied in order to transform PSN speaker independent isolated words HMM to the GSM environment. LMR transformations associated with groups of HMM densities are used to adapt the densities. Both mean vectors and covariance matrices of the densities are adapted. It has been shown that few amount of GSM data are sufficient to transform the PSN HMM in order to match the GSM environment and to achieve performances equivalent to those of an HMM trained with large amount of GSM data. The number of groups of Gaussian densities seems to have small influence on the results. However, the minimum number of groups depends on the vocabulary size. Finally, this technique is compared to the Bayesian adaptation and the results show that similar performances can be obtained with both methods.
Li Deng, University of Waterloo (Canada)
An outline and general design of an integrated-multilingual speech recognizer is presented, focusing on its key novelty of cross-language portability. This recognizer extends the one described in Deng and Sun (1994) in that the overlapping features designed originally for American English are improved, generalized, and need only a slight expansion to cover Mandarin/Cantonese Chinese and Canadian French. It also enhances the recognizer of Deng and Sameti (1996) in that the object of dynamic modeling is moved from the observable acoustic domain to the hidden production-affiliated variables defined in the task-dynamic model of speech production (Saltzman and Munhall, 1989). Major components of the recognizer and the related training and recognition algorithms are described.
Stephen A. Zahorian, Old Dominion University (U.S.A.)
Peter L. Silsbee, Old Dominion University (U.S.A.)
Xihong Wang, Old Dominion University (U.S.A.)
This paper presents methods and experimental results for phonetic classification using 39 phone classes and the NIST recommended training and test sets for NTIMIT and TIMIT. Spectral/temporal features which represent the smoothed trajectory of FFT derived speech spectra over 300 ms intervals are used for the analysis. Classification tests are made with both a binary-pair partitioned (BPP) neural network system (one neural network for each of the 741 pairs of phones) and a single large neural network. Classification accuracy is very similar for the two types of networks, but the BPP method has the advantage of much less training time. The best results obtained (77% for TIMIT and 67.4% for NTIMIT) compare favorably to the best results reported in the literature for this task.