Spacer ICASSP '98 Main Page

Spacer
General Information
Spacer
Conference Schedule
Spacer
Technical Program
Spacer
    Overview
    50th Annivary Events
    Plenary Sessions
    Special Sessions
    Tutorials
    Technical Sessions
    
By Date
    May 12, Tue
May 13, Wed
May 14, Thur
May 15, Fri
    
By Category
    AE    ANNIV   
COMM    DSP   
IMDSP    MMSP   
NNSP    PLEN   
SP    SPEC   
SSAP    UA   
VLSI   
    
By Author
    A    B    C    D    E   
F    G    H    I    J   
K    L    M    N    O   
P    Q    R    S    T   
U    V    W    X    Y   
Z   

    Invited Speakers
Spacer
Registration
Spacer
Exhibits
Spacer
Social Events
Spacer
Coming to Seattle
Spacer
Satellite Events
Spacer
Call for Papers/
Author's Kit

Spacer
Future Conferences
Spacer
Help

Abstract -  SP1   


 
SP1.1

   
On the Robust Incorporation of Formant Features into Hidden Markov Models for Automatic Speech Recognition
P. Garner, W. Holmes  (DERA, UK)
A formant analyser is interpreted probabilistically via a noisy channel model. This leads to a robust method of incorporating formant features into hidden Markov models for automatic speech recognition. Recognition equations follow trivially, and Baum-Welch style re-estimation equations are derived. Experimental results are presented which provide empirical proof of convergence, and demonstrate the effectiveness of the technique in achieving recognition performance advantages by including formant features rather than only using cepstrum features.
 
SP1.2

   
Exploiting Acoustic Feature Correlations by Joint Neural Vector Quantizer Design in a Discrete HMM System
C. Neukirchen, D. Willett, S. Eickeler, S. Müller  (Gerhard-Mercator-University Duisburg, Germany)
In previous work about hybrid speech recognizers with discrete HMMs we have shown that VQs, that are trained according to an MMI criterion, are well suited for ML estimated Bayes classifiers. This is only valid for single VQ systems. In this paper we extend the theory to speech recognizers with multiple VQs. This leads to a joint training criterion for arbitrary multiple neural VQs that considers the inter VQ correlation during parameter estimation. The idea of a gradient based joint training method is derived. Experimental results indicate that inter VQ correlations can cause some degradation of recognition performance. The joint multiple VQ training decorrelates the quantizer labels and improves system performance. In addition the new training criterion allows for a less careful way of splitting up the feature vector into multiple streams that do not have to be statistically independent. In particular the usage of highly correlated features in conjunction with the novel training criterion in the experiments leads to important gains in recognition performance for the speaker independent Resource Management database and gives the lowest error rate of 5.0\% we ever obtained in this framework.
 
SP1.3

   
A NN/HMM Hybrid for Continuous Speech Recognition with a Discriminant Nonlinear Feature Extraction
G. Rigoll, D. Willett  (Duisburg University, Germany)
This paper deals with a hybrid NN/HMM architecture for continuous speech recognition. We present a novel approach to set up a neural linear or nonlinear feature transformation that is used as a preprocessor on top of the HMM system's RBF-network to produce discriminative feature vectors that are well suited for being modeled by mixtures of Gaussian distributions. In order to omit the computational cost of discriminative training of a context-dependent system, we propose to train a discriminant neural feature transformation on a system of low complexity and reuse this transformation in the context-dependent system to output improved feature vectors. The resulting hybrid system is an extension of a state-of-the-art continuous HMM system, and in fact, it is the first hybrid system that really is capable of outperforming these standard systems with respect to the recognition accuracy, without the need for discriminative training of the entire system. In experiments carried out on the Resource Management 1000-word continuous speech recognition task we achieved a relative error reduction of about 10% with a recognition system that, even before, was among the best ever observed on this task.
 
SP1.4

   
Incorporating Voice Onset Time to Improve Letter Recognition Accuracies
P. Niyogi, P. Ramesh  (Bell Labs, Lucent Technologies, USA)
We consider the possibility of incorporating distinctive features into a statistically based speech recognizer. We develop a two pass strategy for recognition with a standard HMM based first pass followed by a second pass that performs an alternative analysis to extract class-specific features. For the voiced/voiceless distinction on stops for an alphabet recognition task, we show that a linguistically motivated acoustic feature exists (the VOT), provides superior separability to standard spectral measures, and can be automatically extracted from the signal to reduce error rates by 48.7% over state of the art HMM systems.
 
SP1.5

   
On the Use of Normalized LPC Error Towards Better Large Vocabulary Speech Recognition Systems
R. Chengalvarayan  (Lucent Technologies, USA)
Linear prediction (LP) analysis is widely used in speech recognition for representing the short time spectral envelope information of speech. The predictive residues are usually ignored in LP analysis based speech recognition system. In this study, the normalized residual error based based on LP is introduced and the performance of the recognizer has been further improved by the addition of this new feature along with its first and second order derivative parameters. The convergence property of the training procedure based on the minimum classification error (MCE) approach is investigated, and experimental results on city name recognition task demonstrated a 8% string error rate reduction by using the extended feature set as compared to conventional feature set.
 
SP1.6

   
Use of Periodicity and Jitter as Speech Recognition Features
D. Thomson, R. Chengalvarayan  (Lucent Technologies, USA)
We investigate a class of features related to voicing parameters that indicate whether the vocal chords are vibrating. Features describing voicing characteristics of speech signals are integrated with an existing 38-dimensional feature vector consisting of first and second order time derivatives of the frame energy and of the cepstral coefficients with their first and second derivatives. HMM-based connected digit recognition experiments comparing the traditional and extended feature sets show that voicing features and spectral information are complementary and that improved speech recognition performance is obtained by combining the two sources of information.
 
SP1.7

   
Accent Type Recognition and Syntactic Boundary Detection of Japanese Using Statistical Modeling of Moraic Transitions of Fundamental Frequency Contours
K. Hirose, K. Iwano  (University of Tokyo, Japan)
Experiments on accent type recognition and syntactic boundary detection of Japanese speech were conducted based on the statistical modeling of voice fundamental frequency contours formerly proposed by the authors. In the proposed modeling, fundamental frequency contours are segmented into moraic units to generate moraic contours, which are further represented by discrete codes. After modeling the accent types and syntactic boundaries, their recognition/detection was done for ATR speech corpus. As for the accent type recognition, 4-morawords were used for the training and testing, and recognition rates around 74% were obtained for speaker open experiments. As for the syntactic boundary detection, detectability of accent phrase boundaries was tested for sentence speech. Although the experiments were conducted only for the closed condition due to availability of speech corpus, the result indicated the usefulness of separating the boundary model into two depending on whether the boundary is accompanied by a pause or not.
 
SP1.8

   
A Novel Feature-Extraction for Speech Recognition Based on Multiple Acoustic-Feature Planes
T. Nitta  (Toshiba Corporation, Japan)
In this paper, the author tries to incorporate functions of the auditory nerve system into the feature extractor of speech recognition. The functions include four types of well-known responses to sound stimuli: local peaks of spectrum in steady sound, ascending FM sound, descending FM sound, and sharply rising and falling sound. Each function is realized in the form of a 3-levels differential operator and applied to a time-spectrum pattern X(t,f) of the output of BPF with 26-hannels. The resultant acoustic cue of an input speech represented by multiple acoustic-feature planes ( MAFP ) is compressed by using KLT, then classified. In the experiments performed on a Japanese E-set ( 12 consonantal parts of / Ci / ) extracted from continuous speech, the MAFP significantly improved the error rate from 34.5 % and 29.6 % obtained by X(t,f) and X(t,f)+dX(t,f) to 17.0% for unknown speakers.
 

 

SP2 - Next Abstract >