ABSTRACT
A system for discriminative feature and model design is presented for automatic speech recognition. Training based on minimum classification error using a single objective function is applied for designing a set of parallel networks performing feature transformation and a set of hidden Markov models performing speech recognition. This paper compares the use of linear and non-linear functional transformations when applied to conventional recognition features, such as spectrum or cepstrum. It also provides a framework for integrated feature and model training when using class-specific transformations. Experimental results on telephone-based connected digit recognition are presented.
ABSTRACT
In this paper we present a context dependent hybrid MMI-connectionist / Hidden Markov Model (HMM) speech recognition system for the Wall Street Journal (WSJ) database. The hybrid system is build with a neural network, which is used as a vector quantizer (VQ) and an HMM with discrete probablility density functions, which has the advantage of a faster decoding. The neural network is trained on an algorithm, that tries to maximize the mutual information between the classes of the input features (e.g. phones, triphones, etc.) and the neural firing sequence of the network. The system has been trained on the 1992 WSJ corpus (si-84). Tests were performed on the five- and twentythousand word, speaker independent (si_et) tasks. The error rates of a new context dependend neural network are 29% lower (relative) than the error rates of a standard (k-means) discrete system and the ratesare very close to the best continuous/semi-continuous HMM speech recognizers
ABSTRACT
We present a neural fuzzy network architecture devoted to the recognition of specific segmental phonetic features.. A neural fuzzy network allows us to select the best acoustic parameters associated with eachfeature and to compute an phonetic segmental plausibility score. Segments result from the alignements provided by an allophone based Markov model. These segmental scores are then processed by a statistical post-processing system for reordering the N-best HMM hypotheses. This post-processing is based on the computation of segmental scores for each solution under the hypotheses of a correct solution and of an incorrect solution. Moreover, we present comparison results between these neural fuzzy network architecture and a classical one, on 3 speaker-independent telephone databases.
ABSTRACT
This work presents experiments on four segmental training algorithms for mixture density HMMs. The segmental versions of SOM and LVQ3 suggested by the author are compared against the conventional segmental K-means and the segmental GPD. The recognition task used as a test bench is the speaker dependent, but vocabulary independent automatic speech recognition. The output density function of each state in each model is a mixture of multivariate Gaussian densities. Neural network methods SOM and LVQ are applied to learn the parameters of the density models from the mel-cepstrum features of the training samples. The segmental training improves the segmentation and the model parameters by turns to obtain the best possible result, because the segmentation and the segment classification depend on each other. It suffices to start the training process by dividing the training samples approximatively into phoneme samples.
ABSTRACT
Connectionist Models can be considered as an encouraging approach to Example-Based Machine Translation. However,] the neural translators developed in the literature are quite complex and require great human effort to classify and prepare training data. This paper presents an effective and more simple text-to-text connectionist translator with which translations from the source to the target language can be directly, automatically and successfully approached. The neural system, which is based on an Elman Simple Recurrent Network, was trained to tackle a simple pseudo-natural Machine Translation task.
ABSTRACT
The phonetic context has a large effect on phonemes in a continuous speech signal [1]. Therefore recognition systems that model allophones using context-dependent Hidden Markov Models have been implemented [4]. Second-order HMMs (HMM2s have a great ability for the segmentation in the temporal domain [6][7] but have some difficulties in the recognition because the MLE training (Maximum Likelihood Estimation) is not discriminant, whereas the discrimination is one of the abilities of the Artificial Neural Networks models. In the last three years we have developed a new ANN model named OWE (Orthogonal Weight Estimator)[10][11]. The principle of the OWE is a ANN that classifies an input pattern according to contextual environment. This new ANN architecture tackles the problem of context dependent behaviour training. Roughly, the principle is based on main MLP (Multilayered Perceptron) in which each synaptic weight connection value is estimated by another MLP (an OWE) with respect to context representation. In this paper, we present 2 hybrid systems for phoneme recognition. In both systems, 48 context independent HMM2s segment the input signal. In the first system, the OWE performs the labelling of segments and, in the second system, the OWE outputs are the input frames of the HMM2s. Experiments on TIMIT range from 56% to 67% accuracies on the 48 phonemes set.