ABSTRACT
A major challenge in speech recognition based on acoustic subword units is creating a lexicon which is robust to inter- and intra-speaker variations. In this paper we present two different approaches for incorporating simple word-level linguistic knowledge into the labelling step of the training procedure. The proposed systems also utilise a scheme for combined optimisation of baseforms and subword models. For the TI46 database, these methods are shown to greatly improve the performance compared to an acoustic subword based speech recogniser employing unsupervised labelling, and they are found to perform as well as systems utilising whole-word models and context independent phoneme models.
ABSTRACT
The performance of the Philips system for large vocabulary continuous speech recognition has been improved significantly by crossword N-phone modelling, enhanced clustering of HMM-states during training, consistent handling of untrained HMM-states during decoding and a new effcient crossword N-phone M-gram decoding strategy. We report word error rate reductions of up to 18% on various ARPA test sets as compared to our best within-word triphone system, based on Laplacian densities, Viterbi decoding and lterbank-LDA features. The following two issues are addressed: a) Transformation of a tree-organized bigram beam- search decoder into an effcient tree-organized decoder capable of handling long-span acoustic contexts as well as long-span language model contexts. b) State-clustering and generalizing of unseen contexts for the case of Laplacian emission probability density functions.
ABSTRACT
This paper explores the modelling of inter-frame dependence as a means of improving the performance of HMMs. More specifically, a model based on the IFD- HMM (Ming & Smith, 1996) that assumes a dependency upon both succeeding and preceeding frames is proposed. The means by which a dependency upon succeeding frames might be integrated into a HMM framework are explored, and a mathematical outline of the proposed extension given. The results of various tests aimed towards exploring the consequences of introducing succeeding frame dependencies are included. It was found that a dependency upon succeeding frames enabled dynamic spectral information, not found in the preceeding frames, to be usefully employed; resulting in a significant increase in the recognition accuracy. Additionally, it was shown that modelling of the dynamic spectral information (using time-lag sequences) was at least as important as improved modelling of the instantaneous spectra (using multiple mixtures).
ABSTRACT
The vast majority of work in continuous speech recognition uses phoneme-like units as the basic recognition component. The work presented here investigates the practicability of syllable-like units as the building blocks for recognition. A phonetically annotated telephony database is analysed at the syllable level, and a set of syllable-based HMMs are built. Refinements including the introduction of syllable-level bigram probabilities, word- and syllable-level insertion penalties, and the investigation of different model topologies are found to improve recogniser performance. It is found that the syllable-based recogniser gives recognition accuracies of over 60%, which compares with 35% as the baseline accuracy for monophone recognition. It is envisaged that practical applications of syllable recognition could be in a hybrid system, where the most common syllable HMMs would be used in conjunction with whole-word and phoneme models.
ABSTRACT
In this paper we present a new approach for a generalized tying of mixture components for continuous mixture-density HMM-based speech recognition systems. With an iterative pruning and splitting procedure for the mixture components, this approach offers a very accurate and detailed representation of the acoustic space and at the same time keeps the number of parameters reasonably small in favor of a robust parameter estimation and a fast decoding. Contrary to other approaches, it does not require a strict clustering of the pdfs into subsets that share their mixture components, so that it is capable of providing more general and flexible types of mixture tying. We applied the new approach on a semi-continuous HMM (SCHMM)-system for the Resource Management task and improved its recognition performance by 12% and vastly accelerated the decoding because of a much faster likelihood computation.
ABSTRACT
In this paper several modifications of two methods for parameter reduction of Hidden Markov Models by state tying are described. The two methods represent a data driven clustering triphone states with a bottom up algorithm [3, 9], and a top down method growing decision trees for triphone states [2, 10]. We investigate several aspects of state tying as the possible reduction of the word error rate by state tying, the consequences of different distance measures for the data driven approach and modications of the original decision tree approach such as node merging. The tests were performed on the test corpora for the 5 000 word vocabulary of the WSJ November 92 task and on the evaluation corpora for the 3 000 word VERBMOBIL '95 task. The word error rate by state tying was reduced by 14% for the WSJ task and by 5% for the VERBMOBIL task.
ABSTRACT
In [1], we described how to improve Semi-Continuous Density Hidden Markov Models (SC-HMMs) to be as fast as Continuous Density HMMs (CD-HMMs), whilst outperforming them on large vocabulary recognition tasks with context independent models. In this paper, we extend our work with SC-HMMs to context dependent modelling. We propose a novel node splitting criterion in an approach with phonetic decision trees. It is based on a distance measure between mixture gaussian probability density functions (pdfs) as used in the final tied state SC-HMMs, this in contrast with other criteria which are based on simplified pdfs to manage the algorithm complexity. Results on the ARPA Resource Management task show that the proposed criterion outperforms two of these criteria with simplified pdfs.
ABSTRACT
A technique for predicting triphones by concatenation of diphone or monophone models is studied. The models are connected using linear interpolation between endpoints of piece-wise linear parameter trajectories. Three types of spectral representation are compared: formants, filter amplitudes and cepstrum coefficients. The proposed technique lowers the spectral distortion of the phones for all three representations when different speakers are used for training and evaluation. The average error of the created triphones is lower in the filter and cepstrum domains than for formants. This is explained to be caused by limitations in the Analysis-by-Synthesis formant tracking algorithm. A small improvement with the proposed technique is achieved for all representations in the task of reordering N-best sentence recognition candidate lists.
ABSTRACT
This paper deals with the choice of suitable subword units (SWU) for a HMM based speech recognition system. Using demisyllables (including phonemes) as base units, an inventory of domain-specific larger sized subword units, so-called macro-demisyllables (MDS), is created. A quality measure for the automatic decomposition of all single words into subword units is presented which takes into account the trainability of the chosen units. To create the whole inventory an iterative procedure is applied with respect to the predefined quality measure. Each MDS is represented by a dedicated HMM. By tying the densities of specific phonemes, only the number of mixture coefficients and transitions increases in comparison to the original phoneme models. Recogniton experiments within the German Verbmobil evaluation 1996 show that the new simple MDS models are as powerful as standard triphone models, although our MDS models are up to now context-independent.
ABSTRACT
The aim of the research described in this paper is to overcome the modeling limitation of conventional hidden Markov models. We present a segmental model that consists of two elements. The first is a nonparametric representation of both the mean and variance trajectories, which describes the local dynamics. The second element is some parameterized transformation (e.g., random shift) of the trajectory that is global to the segment and models long-term variations such as speaker identity.
ABSTRACT
Recently, we have developed a probabilistic framework for segment-based speech recognition that represents the speech signal as a network of segments and associated feature vectors [2]. Although in general, each path through the network does not traverse all segments, we argued that each path must account for all feature vectors in the network. We then demonstrated an efficient search algorithm that uses a single additional model to account for segments that are not traversed. In this paper, we present two new extensions to our framework. First, we replace our acoustic segmentation algorithm with "segmentation by recognition," a probabilistic algorithm that can combine multiple contextual constraints towards hypothesizing only the most likely segments. Second, we generalize our framework to "near-miss modeling" and describe a search algorithm that can efficiently use multiple models to enforce contextual constraints across all segments in a network. We report experiments in phonetic recognition on the TIMIT corpus in which we achieve a diphone context-dependent error rate of 26.6% on the NIST core test set over 39 classes. This is a 12.8% reduction in error rate from our best previously reported result.
ABSTRACT
An approach to speech recognition using syllables as basic modelling units is compared to a state-of-the-art system employing phonemes. The technological framework is a hybrid HMM-ANN 1 recognition system applied on small to medium vocabulary recognition tasks. Although the number of units to be classified nearly doubles, it is shown that the syllable can outperform the phoneme slightly but significantly in terms of unit classification capability, measured as frame error rate. Compar- ing the overall system performance (measured in word error rate) the phoneme-based system still performs obviously better for continuous speech tasks, while the syllable-based system is superior for isolated word recognition tasks on cross-database tests. This suggests the need for further work on the understanding of the interaction of knowledge sources on the frame-, word-, and sentence-level in current recognition systems.
ABSTRACT
Segment-based speech recognition systems have been proposed in recent years to overcome some of the deficiencies of the current state-of-the-art HMM based systems. In this paper, we present a segmental speech recogniser, where the speech trajectory segments are modelled using their mean, variance and shape. The shape is chosen from a codebook of global vector quantised trajectories, obtained from uniformly segmented training utterances. Experiments were done for a speaker dependent isolated word recognition application under different noise environments. The results have shown that this segment based approach outperforms HMM based speech recognition systems under similar test conditions. In adverse noise conditions, up to 34% error rate reduction was achieved.
ABSTRACT
Continuous Speech Recognition Systems (CSR) usually include large sets of context dependent units to model contextual variations in the pronunciation of phones. The goal of this work was to obtain adequate sets of sub-lexical models by using acoustic information but excluding any previous phonological knowledge. At each iteration of a classical Viterbi training scheme each acoustic model was split into a set of more accurate models. This approach was evaluated over a Spanish acoustic phonetic decoding task. The experimental results showed that this approach produces similar recognition rates than classical triphones.
ABSTRACT
In this paper we introduce the demiphone as a contextual phonetic unit for continuous speech recognition. A phone is divided into two parts: a left demiphone that accounts for the left side coarticulation and a right demiphone that copes with the right side context. This new unit discards the dependence between the effects of both side contexts, but provides a better training of the transition between phones. The demiphone can be seen as a heuristic clustering of states that allows a more smoothed training of hidden Markov models and additionally supplies a simple way to create unseen triphones. We report experimental evidence that demiphones outperform the usual combination of triphones, right-side and left-side biphones and monophones.
ABSTRACT
Aiming at robust speech recognition, we have proposed a framework for "phonological concept formation," which is the task of acquiring an efficient representation of phonemes from spoken word samples without using any transcriptions except for the lexical classification of the words. In order to implement this task, we propose the "piecewise linear segment lattice (PLSL)" model for phoneme representation. The structure of this model is a lattice of segments, each of which is represented as regression coefficients of feature vectors within the segment. In order to organize phone models, operations including division, concatenation, blocking and clustering are applied to the models. Feasibility of the method is discussed with experimental results for isolated word recognition. The recognition rate is improved by applying these operations.
ABSTRACT
Most state-of-the-art speech recognizers benefit from some kind of context information in their acoustic modeling [1][2][3]. The most common approach to context clustering is a divisive method that is iteratively building decision trees [4][5]. The problem, when to stop the growing of the tree is usually solved by choosing the maximum number of resulting models that can be supported by the available training data and/or computer memory and CPU power. In this paper we propose a new algorithm, that not only offers an optimized stopping criterion, but also uses a likelihood-based distance measure that optimizes the likelihood of unseen training-data at every splitting of a decision tree node. We evaluate our algorithm on the Wall Street Journal task, and show that it outperforms an algorithm using an entropy-based distance measure.
ABSTRACT
We present a novel method for recovering articulator movements from speech acoustics based on a constrained form [9] of a hidden Markov model. The model attempts to explain sequences of high dimensional data using smooth and slow trajectories in a latent variable space. The key insight is that this continuity constraint when applied to speech helps to solve the \ill-posed" problem of acoustic to articulatory mapping. By working with sequences of spectra rather than looking only at individual spectra, it is possible to choose between competing articulatory congurations for any given spectrum by selecting the conguration \closest" to those at nearby times. We present results of applying this algorithm to recover articulator movements from acoustics using data from the Wisconsin X-ray microbeam project [3]. We find that the recovered traces are highly correlated with the measured articulator movements under a single linear transform. Such recovered traces have the potential to be used for speech recognition, an application we are currently investigating.
ABSTRACT
In this work we describe several approaches to determine an effective set of subword units for modeling the spoken Greek language. We tried to form a concrete set of basic units which must have the capability of giving a unique phonetic transcription for every input utterance. The results of an extensive set of experiments showed that the use of longer units than phonemes can lead to a significant improvement in a system's performance. Three sets of subword units were finally formed regarding the way we combined the 42 phonemes of the Greek Language. The three approaches showed better results than the baseline phoneme-based system and the most effective one proved to be the second approach in which we used two-phoneme combinations of the types non-vowel/vowel and non-vowel/non-vowel. The phoneme recognition rate of the system increased almost by 9% (reaching a level of 78.65%) for the best situation compared to the baseline system.
ABSTRACT
The problem addressed by this paper is to enhance the continuous speech recognizers robustness to noise. For this purpose, the acoustic signal is filtered into several spectral bands, and independent recognition is achieved in each band. Then, the system recombines the results given by each recognizer and delivers a unique solution. The main advantage of this method is to consider the signal only in the bands which are relevant, and to ignore spectral bands which are corrupted by noise. We are developping a speaker-independent continuous speech recognizer based on this principle.
ABSTRACT
This paper presents a two-stage procedure, based on the Fisher criterion and automatic classification trees, for designing acoustic parameters (APs) that target phonetic features in the speech signal. This procedure and a subset of the TIMIT 1 training set were used to develop acoustic parameters for the phonetic features: sonorant, syllabic, strident, palatal, alveolar, labial and velar. Results on a subset of the TIMIT test set show that the developed parameters achieve correct phonetic-feature classification rates in the 90 % range with the exception of stop- consonant place of articulation (labial, alveolar and velar) where correct classification is about 73 %. Furthermore, it is shown that by basing the acoustic parameters on relative measures (e.g. an acoustic parameter that measures energy in a frequency band relative to energy in the same band at another time instant) the effect of interspeaker variability (e.g. gender) on the parameters is reduced.