Kazuyo Tanaka, Electrotech. Lab. (Japan)
Hiroaki Kojima, Electrotech. Lab. (Japan)
Feature extraction plays a substantial role in automatic speech recognition systems. In this paper, a method is proposed to extract time-varying acoustic features that are effective for speech recognition. This issue is discussed from two aspects: one is on speech power spectrum enhancement and the other is on discriminative time-varying feature extraction which employs subphonetic units, called demiphonemes, for distinguishing non-steady labels from steady ones. We confirm its potential by applying it to spoken word recognition. The results indicate that recognition scores are improved by using the proposed features, compared with those using ordinary features such as delta-mel-cepstra provided by a well-known software tool.
Irina Illina, CRIN/CNRS, INRIA-Lor. (France)
Yifan Gong, Texas Instruments (U.S.A.)
In this paper, a study of topology of Hidden Markov Model (HMM) used in speech recognition is addressed. Our main contribution is the introduction of the notion of trajectory folding phenomenon of HMM. In complex phonetic contexts and in speaker-variability, this phenomenon degrades the discriminability of HMM. The goal of this paper is to give some explanation and experimental evidence suggesting the existence of this phenomenon. The systems eliminating (partially or entirely) the trajectory folding are HMM with a special topology, called Trajectory Mixture HMM (TMHMM), and a Mixture Stochastic Trajectory Model linebreak (MSTM), proposed recently. HMM, TMHMM and MSTM have been tested on a 1011 words vocabulary, speaker dependent and multi-speaker continuous French speech recognition task. With similar number of model parameters, linebreak TMHMM and MSTM cuts down the error rate produced by the HMM, which confirms our hypothesis.
Wendy J. Holmes, DRA Malvern (U.K.)
Martin J. Russell, DRA Malvern (U.K.)
This paper describes investigations into the use of linear dynamic segmental hidden Markov models (SHMMs) for modelling speech feature-vector trajectories and their associated variability. These models use linear trajectories to describe how features change over time, and distinguish between extra-segmental variability of different trajectories and intra-segmental variability of individual observations around any one trajectory. Analyses of mel cepstrum features have indicated that a linear trajectory is a reasonable approximation when using models with three states per phone. Good recognition performance has been demonstrated with linear SHMMs. This performance is, however, dependent on the model initialisation and training strategy, and on representing the distributions accurately according to the model assumptions.
Toshiaki Fukada, ATR (Japan)
Yoshinori Sagisaka, ATR (Japan)
Kuldip K. Paliwal, ATR (Japan)
In this paper, we propose parameter estimation techniques for mixture density polynomial segment models (henceforth MDPSM) where their trajectories are specified with an arbitrary regression order. MDPSM parameters can be trained in one of three different ways : (1) segment clustering, (2) expectation maximization (EM) training of mean trajectories, or (3) EM training of mean and variance trajectories. These parameter estimation methods were evaluated in TIMIT vowel classification experiments. The experimental results showed that modeling both the mean and variance trajectories are consistently superior to modeling only the mean trajectory. We also found that modeling both trajectories results in significant improvements over the conventional HMM.
Jan Verhasselt, RUG (Belgium)
Jean-Pierre Martens, RUG (Belgium)
Irina Illina, CRIN/CNRS, INRIA-Lorraine, Nancy (France)
Jean-Paul Haton, CRIN/CNRS, INRIA-Lorraine, Nancy (France)
Yifan Gong, PSL, TI (U.S.A.)
In segment based recognizers, variable length speech segments are mapped to the basic speech units (phones, diphones,...). In this paper, we address the acoustical modeling of these basic units in the framework of segmental posterior distribution models (SPDM). The joint posterior probability of a unit sequence (underline)u and a segmentation (underline)s, Pr((underline)u,(underline)s|(underline) x) can be written as the product of the segmentation probability Pr((underline)s|(underline) x) and the unit classification probability Pr((underline)u|(underline)s,(underline) x), where (underline) x is the sequence of acoustic observation parameter vectors. In particular, we point out the role of the segmentation probability and demonstrate that it does improve the recognition accuracy. We present evidence for this in two different tasks (speaker dependent continuous word recognition in French and speaker independent phone recognition in American English) in combination with two different unit classification models.
Ashvin Kannan, Boston University (U.S.A.)
Mari Ostendorf, Boston University (U.S.A.)
Segment models are a generalization of HMMs that can represent feature dynamics and/or correlation in time. In this work we develop the theory of Bayesian and maximum-likelihood adaptation for a segment model characterized by a polynomial mean trajectory. We show how adaptation parameters can be shared and adaptation detail can be controlled at run-time based on the amount of adaptation data available. Results on the Switchboard corpus show error reductions for unsupervised transcription mode adaptation and supervised batch mode adaptation.
Chengalv Rathinavelu, University of Waterloo (Canada)
Li Deng, University of Waterloo (Canada)
In this paper, we report our recent work on applications of the MAP approach to estimating the time-varying polynomial Gaussian mean functions in the nonstationary-state or trended HMM. Assuming uncorrelatedness among the polynomial coefficients in the trended HMM, we have obtained analytical results for the MAP estimates of the time-varying mean and precision parameters. We have implemented a speech recognizer based on these results in speaker adaptation experiments using TI46 corpora. Experimental results show that the trended HMM always outperforms the standard, stationary-state HMM and that adaptation of polynomial coefficients only is better than adapting both polynomial coefficients and precision matrices when fewer than four adaptation tokens are used.
Kyuwoong Hwang, ETRI, Taejon (Korea)
In this paper, we suggest a method to optimize the vocabulary for a given task using the perplexity criterion. The optimization allows us to reduce the size of the vocabulary at the same perplexity of the original word based vocabulary or to reduce perplexity at the same vocabulary size. This new approach is an alternative to phoneme n-gram language model in the speech recognition search stage. We show the convergence of our approach on the Korean training corpus. This method may provide an optimized speech recognizer for a given task. We used phonemes, syllables, morphemes as the basic units for the optimization and reduced the size of the vocabulary to the half of the original word vocabulary size for the morpheme case.
Philippe Gelin, Institut Eurecom (France)
Christian J. Wellekens, Institut Eurecom (France)
Indexing of video soundtracks is an important issue for the navigation in multimedia databases. Based on wordspotting techniques, it should meet very constraining specifications; namely fast response to queries, concise processed speech information for limiting the storage memory, speaker independant mode, easy characterization of any word by its phonemic spelling. A solution based on phonemic lattices and on a division of the indexing process into an off-line and an on- line part is proposed in this paper. Previous works [1][2] based on frame labelling and Maximum Likelihood criterion are now modified to take into account this new approach based on a Maximum a Posteriori (MAP) criterion. The REMAP algorithm [3] implements this MAP criterion for training. It has several avantages such as maximizing the global discriminant criterion, avoiding the difficult problem of phoneme transition detection during the training process and being well suited for a hybrid Hidden Markov Model (HMM) and Neural Network (NN) approach.
Jinhai Cai, University of Melbourne (Australia)
Zhi-Qiang Liu, University of Melbourne (Australia)
Most of the well known and widely used pitch determination algorithms are frame-based. They only consider the speech local stationarity within the analysis frame. However, our novel pitch determination algorithms employ the steerable filters to obtain the direction of pitch change. Therefore, the proposed algorithms not only make full use of the information within an analysis frame, but also optimally utilize the information from neighbor frames by taking the advantage of the pitch direction. This allows us to use more than one frame to enhance pitch peaks for non-stationary, noisy speech signals. As a result, the proposed algorithms are superior to conventional methods in term of accuracy and reliability, and is robust to noise. Besides, the direction of pitch change can be estimated in different domains. Therefore, our algorithms can be applied in either time or frequency domain, or both of them.
Tsuyoshi Moriyama, Keio University (Japan)
Hideo Saito, Keio University (Japan)
Shinji Ozawa, Keio University (Japan)
In this paper, we propose the linear model of the relationship between the physical changes in speech and perceived emotional concepts. We make use of orthogonal bases in spite of emotional words and physical parameters themselves in order to avoid dependence on the method of selecting words and parameters. Furthermore we regard the emotions that listeners perceive from speech as the standard of emotional concepts because the emotions that speakers intented rely on personality and temporary psychological state. Evaluation for relative information indicates that the proposed linear model is representable for the relationship between physical quantities and psychological quantities in speech.
Wolfgang Wokurek, IMS Uni-Stuttgart (Germany)
Simultaneous recordings of the laryngograph signal and speech recorded in an non-reverberating environment are investigated for acoustic evidence of the glottal opening within the microphone signal. It is demonstrated that the high resolution time-frequency analysis of the microphone signal by the smoothed pseudo Wigner distribution (SPWD) shows responses of the vocal tract to both, the glottal closure and the glottal opening. Thus, a convolution-based model for the relation between the laryngograph signal and the microphone signal is evaluated. It turns out, that the microphone signal may be viewed as filtered version of a power function of the laryngograph signal. Hence, such a nonlinear processed laryngograph signal may be an appropriate model for the acoustic excitation of the vocal tract.
Werner Kozek, University of Vienna (Austria)
Hans Georg Feichtinger, University of Vienna (Austria)
We present a new approach to the linear representation of speech signals that combines desirable structure, computational efficiency and almost decorrelation. The basic principle is a statistically adapted, group-theoretical modification of the classical Gabor expansion. In contrast to traditional linear time-frequency (TF) representations which always correspond to a separable tiling of the TF plane, we suggest the use of a hexagonal (thus nonseparable) tiling whose parameters are matched to the TF correlation of the speech signal. We estimate the TF correlation via a pitch-adapted Zak-transform motivated by modeling the vocal tract as underspread system. The TF correlation determines both the optimum tiling and the optimum window.
Filipp Korkmazskiy, Bell Labs (U.S.A.)
Biing-Hwang Juang, Bell Labs (U.S.A.)
Frank Soong, Bell Labs (U.S.A.)
This paper presents a new technique for modeling heterogeneous data sources such as speech signals received via distinctly different channels. Such a scenario arises when an automatic speech recognition system is deployed in wireless telephony in which highly heterogeneous channels coexist and interoperate. The problem is that a simple model may become inadequate to describe accurately the diversity of the signal, resulting in an unsatisfactory recognition performance. To deal with such a problem, we propose a Generalized Mixture Model (GMM) approach. For speech signals, in particular, we use mixtures of hidden Markov models (i.e., GMHMM, Generalized Mixture of HMM's). By applying discriminative training for GMHMM we obtained 1.0% word error rate for the recognition of the digits strings from the wireless database, comparing to 1.4% word error rate for the conventional HMM based discriminative technique.