ABSTRACT
In this paper, we propose and investigate a new approach towards using multiple time scale information in automatic speech recognition (ASR) systems. In this framework, we are using a particular HMM formalism able to process different input streams and to recombine them at some temporal anchor points. While the phonological level of recombination has to be defined a priori, the optimal temporal anchor points are obtained automatically during recognition. In the current approach, those parallel cooperative HMMs will focus on different dynamic properties of the speech signal, defined on different time scales. The speech signal is then defined in terms of several information streams, each stream resulting from a particular way of analyzing the speech signal. More specifically, in the current work, models aimed at capturing the syllable level temporal structure are used in parallel with classical phoneme-based models. Tests on different continuous speech databases show significant performance improvements, motivating further research to eficiently use large time span information of the order of 200 ms into our standard 10 ms, phone-based ASR systems.
ABSTRACT
In our previous work, we proposed a re-entry modeling of missing phonemes which are lost during search process. In the reentry modeling, the recognition results are postprocessed and originally recognized phoneme sequences are converted to new phoneme sequences using HMM-state confusion characteristics spanning several phonemes. We confirmed that HMM-state confusions are effective for the re-entry modeling. In this paper, we propose a re-entry modeling during recognition using a multiple pronunciation dictionary where pronunciations are added using HMM-state confusion characteristics. The pronunciations are added considering part-of-speech (POS) dependency of confusion characteris- tics. As a result of continuous recognition experiments, we confirmed that the following two points are effective to improve word recognition rates: (1) confusions are expressed by HMM-state sequences, (2) pronunciations are added considering part-of-speech dependency of confusion characteristics. they cannot cope with the confusion in consideration of the previous and following context of misrecognized sequences.
ABSTRACT
In this paper we describe our experience with bottom- up and top-down state clustering techniques for the definition and training of robust acoustic-phonetic units. Using as a test-bed a speaker-independent telephone- speech isolated word recognition task with a vocabulary including 475 city names, we show that similar performances are obtained by tying the HMM states both with an agglomerative or a decision-tree clustering approach. Moreover, better results are obtained by a priori selecting the set of states that can be clustered, rather than relying solely on their acoustical similarity. In the bottom-up approach a stopping criterion for the furthest neighbor clustering procedure is proposed that does not require a threshold. In the top-down approach we show that a careful selected impurity function allows lookahead search to outperforms the classical decision tree growing algorithm.
ABSTRACT
In this work we compare two parameter optimization techniques for discriminative training using the MMI criterion: the extended Baum-Welch (EBW) algorithm and the generalized probabilistic descent (GPD) method. Using Gaussian emission densities we found special expressions for the step sizes in GPD, leading to reestimation formula very similar to those derived for the EBW algorithm. Results were produced for both the TI digitstring and the SieTill corpus for continuously spoken American English and German digitstrings. The results for both techniques do not show significant differences. This experimental results support the strong link between EBW and GPD as expected from the analytic comparison.
ABSTRACT
The clustering of using decision trees is generalized to take into account high-level knowledge sources to better model the co-articulation effects in large vocabulary continuous speech recognition. VQ models are used to reduce the computational cost in constructing decision trees. The search algorithm is designed such that it can provide a general type of information for decision trees without compromising the speed. Experiments with a 30k-word dictionary on the WSJ task show that the word error rate can be reduced by considering additional knowledge sources. use much more complex acoustic-phonetic models without compromising the speed in our system. Experiments on the Wall Street Journal task show that it may increase the recognition accuracy to use deci- sion trees with additional knowledge sources.
ABSTRACT
In this study, we developed a modified maximum likelihood (ML) algorithm for efficient computation in implemeting the minimum classifcation error (MCE) like training for optimally estimating the state-dependent polynomial coefficients in the trended HMM. We devised a new discriminative training method which controls the in uence of outliers in the training data on the constructed models. The resulting models seem to provide correct recognition for confusable patterns. For alphabet recognition tasks, outlier emphasis resulted in improved performance. An error rate reduction of 14% is achieved for the linear trend and 7.5% is obtained for the constant trend models over the traditional ML training models.