Spacer ICASSP '98 Main Page

Spacer
General Information
Spacer
Conference Schedule
Spacer
Technical Program
Spacer
    Overview
    50th Annivary Events
    Plenary Sessions
    Special Sessions
    Tutorials
    Technical Sessions
    
By Date
    May 12, Tue
May 13, Wed
May 14, Thur
May 15, Fri
    
By Category
    AE    ANNIV   
COMM    DSP   
IMDSP    MMSP   
NNSP    PLEN   
SP    SPEC   
SSAP    UA   
VLSI   
    
By Author
    A    B    C    D    E   
F    G    H    I    J   
K    L    M    N    O   
P    Q    R    S    T   
U    V    W    X    Y   
Z   

    Invited Speakers
Spacer
Registration
Spacer
Exhibits
Spacer
Social Events
Spacer
Coming to Seattle
Spacer
Satellite Events
Spacer
Call for Papers/
Author's Kit

Spacer
Future Conferences
Spacer
Help

Abstract -  SP17   


 
SP17.1

   
Classification of Speech Under Stress Based on Features Derived from the Nonlinear Teager Energy Operator
G. Zhou, J. Hansen, J. Kaiser  (Duke University, USA)
Studies have shown that distortion introduced by stress or emotion can severely reduce speech recognition accuracy. Techniques for detecting or assessing the presence of stress could help neutralize stressed speech and improve robustness of speech recognition systems. Although some acoustic variables derived from linear speech production theory have been investigated as indicators of stress, they are not consistent. In this paper, three new features derived from the nonlinear Teager Energy Operator (TEO) are investigated for stress assessment and classification. It is believed that TEO based features are better able to reflect the nonlinear airflow structure of speech production under adverse stressful conditions. The proposed features outperform stress classification using traditional pitch by +22.5%for the Normalized TEO Autocorrelation Envelope Area feature (TEO-Auto-Env), and by +28.8% for TEO based Pitch feature (TEO-Pitch). Overall neutral/stress classification rates are more consistent for TEO based features (TEO-Auto-Env: standard deviation = 5.15, TEO-Pitch: standard deviation = 7.83) vs. (Pitch: standard deviation = 23.40). Also, evaluation results using actual emergency aircraft cockpit stressed speech from NATO show that TEO-Auto-Env works best for stress assessment.
 
SP17.2

   
Improved Robustness for Speech Recognition Under Noisy Conditions Using Correlated Parallel Model Combination
J. Hung, J. Shen, L. Lee  (Institute of Information Science, Academia Sinica, Taiwan, ROC)
The parallel model combination (PMC) technique has been shown to achieve very good performance for speech recognition under noisy conditions. In this approach, the speech signal and the noise are assumed uncorrelated during modeling. In this paper, a new correlated PMC is proposed by properly estimating and modeling the nonzero correlation between the speech signal and the noise. Preliminary experimental results show that this correlated PMC can provide significant improvements over the original PMC in terms of both the model differences and the recognition accuracies. Error rate reduction on the order of 14% can be achieved.
 
SP17.3

   
Multi-Resolution Cepstral Features for Phoneme Recognition Across Speech Sub-Bands
P. McCourt, S. Vaseghi, N. Harte  (Queen's University of Belfast, N. Ireland)
Multi-resolution sub-band cepstral features strive to exploit discriminative cues in localised regions of the spectral domain by supplementing the full bandwith cepstral features with sub-band cepstral features derived from several levels of sub-band decomposition. Mult-iresolution feature vectors, formed by concatenation of the subband cepstral features into an extended feature vector, are shown to yield better performance than conventional MFCCs for phoneme recognition on the TIMIT database. Possible strategies for the recombination of partial recognition scores from independent multi-resoltuion sub-band models are explored. By exploiting the sub-band variations in signal to noise ratio for linearly weighted recombination of the log likelihood probabilities we obtained improved phoneme recognition performance in broadband noise compared to MFCC features. This is an advantage over a purely sub-band approach using non linear recombination which is robust only to narrow band noise.
 
SP17.4

   
Improved Model Parameter Compensation Method for Noise-Robust Speech Recognition
Y. Chang, Y. Chung, S. Park  (LGIC, Korea)
In this paper we study model parameter compensation methods for noise-robust speech recognition based on CDHMM. First, we proposed a modified PMC method where adjustment term in the model parameter adaptation is varied depending on mixture components of HMM to obtain more reliable model. A state-dependent association factor that controls the average parameter variability of Gaussian mixtures and the variability of the repective mixtures is used to find the final optimum model parameters. Second, we propose a re-estimation solution of environmental variables with additive noise and spectral tilt based on expectation-maximation(EM) algorithm in the cepstral domain. The approach is based on the vector Taylor series(VTS) approximation. In our experiments on a speaker independent isolated Korean word recognition, the modified PMC show better performance than the Gales' PMC and the proposed VTS is superior to both of them.
 
SP17.5

   
A Fuzzy Logic-Based Speech Detection Algorithm for Communications in Noisy Environments
A. Cavallaro, F. Beritelli, S. Casale  (University of Catania, Italy)
In the field of mobile communications correct Voice Activity Detection (VAD) is a crucial point for the perceived speech quality, the reduction of co-channel interference, the power consumption in portable equipment. This paper shows that a valid alternative to deal with the problem of activity decision is to use methodologies like fuzzy logic, which are suitable for problems requiring approximate rather than exact solutions, and which can be presented through descriptive or qualitative expressions. The Fuzzy Voice Activity Detector (FVAD) proposed uses the same set of parameters adopted by the VAD in Annex B to ITU-T G.729 and a set of six fuzzy rules automatically extracted through supervised learning. Objective and listening tests confirm a significative improvement respect the traditional methods above all for low signal-to-noise ratios.
 
SP17.6

   
Subband Based Classification of Speech under Stress
R. Sarikaya, J. Gowdy  (Clemson University, USA)
This study proposes a new set of feature parameters based on subband analysis of the speech signal for classification of speech under stress. The new speech features are Scale Energy (SE), Autocorrelation-Scale-Energy (ACSE), Subband based cepstral parameters (SC), and Autocorrelation-SC (ACSC). The parameters' ability to capture different stress types is compared to widely used Mel-scale cepstrum based representations: Mel-frequency cepstral coefficents (MFCC) and Autocorrelation-Mel-scale (AC-Mel). Next, a feedforward neural network is formulated for speaker-dependent stress classification of 10 stress conditions: Angry, Clear, Cond50/70, Fast, Loud, Lombard, Neutral, Question, Slow, and Soft. The classification algorithm is evaluated using a previously established stressed speech database (SUSAS)[4]. Subband based features are shown to achieve +7.3% and +9.1% increase in the classification rates over the MFCC based parameters for ungrouped and grouped stress closed vocabulary test scenarios, respectively. Moreover, the average scores across the simulations of new features are +8.6% and +13.6% higher than MFCC based features for the ungrouped and grouped stress test scenarious respectively .
 
SP17.7

   
Separation of Spontaneous and Non-Spontaneous Speech
O. Kenny  (Defense Science & Technology Organization, Australia);   D. Nelson, J. Bodenschatz, H. McMonagle  (Department of Defense, Fort Meade, MD, USA)
This paper outlines and compares three methods for automatically classifying spontaneous and non-spontaneous speech and presents experimental results comparing the performance of the methods. All three methods are based on an analysis of the probability distributions of prosodic features extracted from the speech signal. The first method uses an expansion of the probability distribution in terms of the statistical moments. The second method is an application of a modified Hellinger's method, and the third method is based on a measure of the non-Gaussianity of the data.
 
SP17.8

   
Robust Features Derived from Temporal Trajectory Filtering for Speech Recognition under the Corruption of Additive and Convolutional Noises
K. Yuo, H. Wang  (National Tsing Hua University, Taiwan, ROC)
This paper presents a novel method using robust features for speech recognition when the speech signal is corrupted by additive and convolutional noises. This method is conceptually simple and easy to be implemented. The additive noise and the convolutional noise are removed by temporal trajectory filtering in autocorrelation domain and crpstral domain, respectively. A task of multi-speaker isolated digit recognition is conducted to demonstrate the effectiveness of using these robust features. The case of channel filtered speech signal corrupted by additive white noise and color noise are tested. Experimental results show that significant improvements can be achieved as comparing with some traditional features.
 

< Previous Abstract - SP16

SP18 - Next Abstract >