ICASSP '98 Main Page
 General Information
 Conference Schedule
 Technical Program

Overview
50th Annivary Events
Plenary Sessions
Special Sessions
Tutorials
Technical Sessions
Invited Speakers
 Registration
 Exhibits
 Social Events
 Coming to Seattle
 Satellite Events
 Call for Papers/ Author's Kit
 Future Conferences
 Help
|
Abstract - SP22 |
 |
SP22.1
|
Weighted Viterbi Algorithm and State Duration Modelling for Speech Recognition in Noise
N. Yoma,
F. McInnes,
M. Jack (University of Edinburgh, Scotland, UK)
A weighted Viterbi algorithm (HMM) is proposed and applied in combination with spectralsubtraction and Cepstral Mean Normalization to cancel both additive and convolutional noises in speech recognition. The weighted Viterbi approach is compared and used in combinationwith state duration modelling. The results presented in this paper show that a proprerweight on the information provided by static parameters can substantially reduce the error rate, and that the weighting procedure improves better the robustness of the Viterbi algorithm than the introduction of temporal constraints with a low computational load. Finally, it is shown that the weighted Viterbi algorithm in combination with temporal constraints leads to a high recognition accuracy at moderate SNR's without the need of an accurate noise model.
|
SP22.2
|
Transmissions and Transitions: A Study of Two Common Assumptions in Multi-Band ASR
N. Mirghafori,
N. Morgan (ICSI & UC Berkeley, USA)
Is multi-band ASR inherently inferior to a full-band approach because phonetic information is lost due to the division of the frequency space into sub-bands? Do the phonetic transitions in sub-bands occur at different times? The first statement is a common objection of the critics of multi-band ASR, and the second, a common assumption by multi-band researchers. This paper is dedicated to finding answers to both these questions. To study the first point, we calculate phonetic feature {\it transmission} for sub-bands. Not only do we fail to substantiate the above objection, but we observe the contrary. We confirm the second hypothesis by analyzing the phonetic {\it transition} lags in each sub-band. These results reinforce our view that multi-band speech analysis provides useful information for ASR, particularly when band merging takes place at the end state for a phonetic or syllabic model, allowing sub-bands to be independently time-aligned within the model.
|
SP22.3
|
A Recombination Model for Multi-Band Speech Recognition
C. Cerisara,
J. Haton,
J. Mari,
D. Fohr (Loria, France)
In this paper, we describe a continuous speech recognition system that uses the multi-band paradigm. This principle is based on the recombination of several independent sub-recognizers, each one assigned to a specific frequency band. The major issue of such systems consists of deciding at which time the recombination must be done. Our algorithm lets each band totally independent from the others, and uses the different solutions to resegment the initial sentence. Finally, the bands are synchronously merged together, according to this new segmentation. The whole system is too complex to be entirely described here, and, in this paper, we will concentrate on the synchronous recombination part, which is achieved by a classifier. The system has been tested in clean and noisy environments, and proved to be especially robust to noise.
|
SP22.4
|
Incorporating Information from Syllable-length Time Scales into Automatic Speech Recognition
S. Wu,
B. Kingsbury,
N. Morgan,
S. Greenberg (UC Berkeley & ICSI, USA)
Including information distributed over intervals of syllabic duration (100-250 ms) may greatly improve the performance of automatic speech recognition (ASR) systems. ASR systems primarily use representations and recognition units covering phonetic durations (40-100 ms). Humans certainly use information at phonetic time scales, but results from psychoacoustics and psycholinguistics highlight the crucial role of the syllable, and syllable-length intervals, in speech perception. We compare the performance of three ASR systems: a baseline system that uses phone-scale representations and units, an experimental system that uses a syllable-oriented front-end representation and syllabic units for recognition, and a third system that combines the phone-scale and syllable-scale recognizers by merging and rescoring N-best lists. Using the combined recognition system, we observed an improvement in word error rate for telephone-bandwidth, continuous numbers from 6.8% to 5.5% on a clean test set, and from 27.8% to 19.6% on a reverberant test set, over the baseline phone-based system.
|
SP22.5
|
Towards Speech Rate Independence in Large Vocabulary Continuous Speech Recognition
F. Martínez,
D. Tapias,
J. Alvarez (Telefónica Investigación y Desarrollo, Spain)
In this paper we present a new speech rate classifier (SRC) which is directly based on the dynamic coefficients of the feature vectors and it is suitable to be used in real time. We also report the study that has been carried out to determine what parameters of speech are the best regarding the speech rate classification problem. In this study we analyse the correlation between several speech parameters and the average speech rate of the utterance. Finally, we report a compensation technique which is used together with the SRC. This technique provides with a word error rate (WER) reduction of a 64.1% for slow speech rate and a 32% reduction of the average WER.
|
SP22.6
|
Combining Multiple Estimators of Speaking Rate
N. Morgan,
E. Fosler-Lussier (ICSI/UC Berkeley, USA)
We report progress in the development of a measure of speaking rate that is computed from the acoustic signal. The newest form of our analysis incorporates multiple estimates of rate; besides the spectral moment for a full-band energy envelope that we have previously reported, we also used pointwise correlation between pairs of compressed sub-band energy envelopes. The complete measure, called mrate, has been compared to a reference syllable rate derived from a manually transcribed subset of the Switchboard database. The correlation with transcribed syllable rate is significantly higher than our earlier measure; estimates are typically within 1-2 syllables/second of the reference syllable rate. We conclude by assessing the use of mrate as a detector for rapid speech.
|
SP22.7
|
A Recursive Feature Vector Normalization Approach for Robust Speech Recognition in Noise
O. Viikki (Nokia Research Center, Finland);
D. Bye (Nokia Mobile Phones, Finland);
K. Laurila (Nokia Research Center, Finland)
The acoustic mismatch between testing and training conditions is known to severely degrade the performance of speech recognition systems. Segmental feature vector normalization (8) was found to improve the noise robustness of MFCC feature vectors and to outperform other state-of-the-art noise compensation techniques in speaker-dependent recognition. The objective of feature vector normalization is to provide environment-independent parameter statistics in all noise conditions. In this paper, we propose a more efficient implementation approach for feature vector normalization where the normalization coefficients are computed in a recursive way. Speaker-dependent recognition experiments show that the recursive normalization approach obtains over 60%, the segmental method approx. 50%, and Parallel Model Combination 14% overall error rate reduction, respectively. Moreover, in the recursive case, this performance gain is obtained with the smallest implementation costs. Also in speaker0inedependent connected digit recognition, over 16% error rate reduction is obtained with the proposed feature vector normalization approach.
|
SP22.8
|
Some Solutions to the Missing Feature Problem in Data Classification, with Application to Noise-Robust ASR
A. Morris,
M. Cooke,
P. Green (University of Sheffield, UK)
We address the theoretical and practical issues involved in ASR when some of the observation data for the target signal is masked by other signals. Techniques discussed range from simple missing data imputation to Bayesian optimal classification. The Bayesian approach allows prior knowledge to be incorporated naturally into the recognition process, thereby permitting us to go beyond the simple "integrate over missing data" or "marginals" approach reported elsewhere, which we show to be inadequate for dealing with realistic patterns of missing data. These techniques are formulated in the context of an HMM based CSR system. This scheme is evaluated under both random and more realistic patterns of missing data, with speech from the DARPA RM corpus and noise from NOISEX. We find that a key problem in real world recognition with missing data is that efficient ASR requires data vector components to be independent, and incomplete data cannot be orthogonalised in the usual way by projection. We show that the use of spectral peaks only can provide an effective solution to this problem.
|
SP22.9
|
A Study of Prior Sensitivity For Bayesian Predictive Classification Based Robust Speech Recognition
Q. Huo (University of Hong Kong, P R China);
C. Lee (Bell Labs, USA)
We previously introduced a new Bayesian predictive classification (BPC) approach to robust speech recognition and showed that BPC is capable of coping with many types of distortions. We also learned that the efficacy of the BPC algorithm is influenced by the appropriateness of the prior distribution for the mismatch being compensated. If the prior distribution fails to characterize the variability reflected in the model parameters, then the BPC will not help much. In this paper, we show how the knowledge and/or experience of the interaction between speech signal and the possible mismatch guide us to obtain a better prior distribution which improves the performance of the BPC approach.
|
< Previous Abstract - SP21 |
SP23 - Next Abstract > |
|