ABSTRACT
A blind channel equalization method called signal bias removal (SBR) has been proposed and proved to be effective in compensating the channel effect in telephone speech recognition. However, we found that the SBR method didn't work well when additive noise and multiplicative distortion are taken into account at the same time. In this paper, we propose a new method called modified signal bias removal (MSBR) which tries to overcome the problem described above in the SBR method. Some experiments are conducted to evaluate the effectiveness of the MSBR method. Experimental results show that the MSBR method outperforms the SBR no matter additive noise is considered or not in a telephone speech recognition system.
ABSTRACT
It is known that incorporating the temporal information of state durations into the HMM can achieve higher recognition performance. However, when a speech signal is contaminated by ambient noises, it is very possible for a state to stay too long or too short in decoding a state sequence even if state durations are adopted in the models. This phenomenon will severely reduce the efficiency of modeling techniques for state durations. To overcome this problem, a proportional alignment decoding (PAD) method combining with state duration statistics is proposed and proved experimentally to be effective when the speech signal is distorted by ambient noises. Instead of using Viterbi decoding algorithm, the PAD method is used for state decoding in the retraining phase of a conventional HMM and produce a new set of state duration statistics. This state duration alignment scheme is more efficient to prevent a state from occupying too long or too short in recognition phase.
ABSTRACT
In this paper, a fast PMC (Parallel Model Combination) noise adaptation method is proposed for continuous HMM base speeclt recognizer. Tlte proposed metliod is realized as a direct reduction of the number of PMC processing times by introducing the distribution composition with the spatial relation of distributions. The proposed method is compared with the basic PMC algorithm in recognitiou accuracy and adaptation processing time on telehhone speech. The result showed that the proposed method saved around 65% (70.9%, - 62.7%) of PMC computation amount with almost no degradation of recognition performance.
ABSTRACT
This report proposes a recognition module for use in CSCW that suffers little degradation in recognition performance even when more than one person speaks at the same time and they speak at a distance from a microphone. This is accomplished by controlling directionality using a microphone array and estimating transmission characteristics from speakers to microphones. On the basis of evaluation performed by word spotting from continuous speech, it has been found that this module raises the recognition rate by ( 1 ) 30% in an environment where two people are speaking at the same time, and (2) by 15% when people speak at a distance of 160 cm from a microphone.
ABSTRACT
It is a crucial factor to find the robust and simple computation methods for the actual application of telephone speech recognition. In this paper, we propose a new channel compensation method, which uses a RASTA-like band-pass filter on the mel-frequency cepstral coefficients for robust telephone speech recognition. It is shown from the experiments that the proposed method, comparing with the RASTA processing, reduces the computational complexity without losing performance, and it is also better than CMS and two level CMS on the performance. We also verify that it is an effective approach to suppress very low modulation frequencies for robust telephone speech recognition.
ABSTRACT
Input speech to speech recognition systems may be contaminated not only by various ambient noise but also by various irrelevant sounds generated by users such as coughing, tongue clicking, mouth-noises and certain out-of-task utterances. The authors have developed a speech detection method using the likelihood of partial sentences for detecting task utterance in speech contaminated with these irrelevant sounds. This paper describes this new speech detection method and reports on a field trial of speech recognition systems with the proposed speech detection method.
ABSTRACT
Field evaluations of automatic speech recognition (ASR) systems clearly demonstrate the importance of efficient rejection procedures for filtering out-of-vocabulary tokens. High performance speech recognition systems also require efficient speech detection. This paper presents an original framework for a global evaluation of speech recognition systems allowing to tune the speech detection module of an ASR system. A global evaluation allows to measure the performances of the speech recognition system from the user point of view and to identify the weak modules of an ASR system. Global evaluations are carried out on PSN (Public Switch Network) and GSM (Global System Mobile) databases. On the PSN database, global evaluation is used to choose the best value for the speech detector threshold. The results also show, that for this optimal value, the rejection of out-of-vocabulary words is currently the main problem to be solved for building high performance speech recognition systems for large public telecommunication applications. On GSM database, global evaluation is used to evaluate the benefits of speech enhancement before speech detection. Results show that the use of spectral subtraction as the speech enhancement technique before the detection drastically improves the speech detection, and consequently the global speech recognition.
ABSTRACT
We describe new methods for speaker-independent, continuous mandarin speech recognition based on the IBM HMM-based continuous speech recognition system (1-3): First, we treat tones in mandarin as attributes of certain phonemes, instead of syllables. Second, instantaneous pitch is treated as a variable in the acoustic feature vector, in the same way as cepstra or energy. Third, by designing a set of word-segmentation rules to convert the continuous Chinese text into segmented text, an effective trigram language model is trained(4). By applying those new methods, a speaker-independent, very-large-vocabulary continuous mandarin dictation system is demonstrated. Decoding results showed that its performance is similar to the best results for US English.
ABSTRACT
The task of automatically transcribing general audio data is very different from those usually confronted by current automatic speech recognition systems. The general goal of our work is to determine the optimal training strategy for recognizing such data. Specifically, we have studied the effects of different speaking environments on a phonetic recognition task using data collected from a radio news program. We found that if a single-recognizer is to be used, it is more effective to use a smaller amount of homogeneous, clean data for training. This approach yielded a decrease in phonetic recognition error rate of over 26% over a system trained with an equivalent amount of data which contained a variety of speaking environments. We found that additional gains can be made with a multiple-recognizer system, trained with environment-specific data. Overall, we found that this approach yielded a decrease in error rate of nearly 2%, with some individual speaking environments' error rate decreasing by over 7%.
ABSTRACT
This paper presents the first published results for automatic recognition of continuous Cantonese speech with very large vocabulary. The size of the vocabulary covered by this system is about the same asthat encountered in the Hong Kong local Chinese newspaper, Wen Hui Bao ( ). The system covers 6335 Chinese characters ( ) and a large number of Chinese words () can be formed by combining these Chinese characters. The input to the system is the end pointed speech waveform of a sentence or phrase, the output isthe Big5 coded Chinese characters. In the development ofthe recognition system, we have devised new methods in 1) construction of a continuous Cantonese speech database, 2) lexical tone recognition in continuous Cantonese speech, and 3)integration of lexical tone and base syllable recognition results. The speaker dependent recognition rates for Chinese character, base syllable and lexical tone are 90.94%, 94.73% and 69.7% respectively.
ABSTRACT
We refer to environment e as some combination of speaker, handset, transmission channel and noise background condition, and regard any practical situation of a speech recognizer as a mixture of environments. A speech recognizer may be trained on multi-environment data. It may also need to adapt the trained acoustic models to new conditions. How to train an HMM with multi- environment data and from what seed model to start an adaptation are two questions of great importance. We propose a new solution to speech recognition which is based on, for both training and adaptation, a separate modeling of phonetic variation and environment variations. This problem is formulated under hidden Markov process, where we assume, - Speech x is generated by some canonical (independent ofenvironmental factors) distributions, - An unknown linear transformation We and a bias be, specific to environment e, is applied to x with probability P(e), - x cannot be observed, what we observe is the outcome of the transformation: o = Wex + be. Under maximum-likelihood (ML) criterion, by application of EM algorithm and the extension of Baum's forward and backward variables and algorithm, we obtained a joint solution to the parameters of the canonical distributions, the transformations and the biases, which is novel. For special cases, on a noisy telephone speech database, the new formulation is compared to per-utterance cepstral mean normalization (CMN) technique and shows more than 20% word error rate improvement.
ABSTRACT
The development and evaluation of large vocabulary, speaker-independent continuous speech recognition systems are mainly done for the American English language. In this paper we present the work done to date in the development of an hybrid large vocabulary, speaker-independent continuous speech recognition system for the European Portuguese language. Due to the lack of a large appropriate speech and text database to be used in the development of that system we started collecting a large database and at the same time began developing a baseline system based on a smaller database. On this baseline system we applied techniques for automatic segmentation and labeling, in parallel with the development of a basic lexicon and language model for Portuguese. In the last part of this paper we also present the first steps of our work over the new database.
ABSTRACT
This paper describes an approach to identifying the reasons that speech recognition errors occur. The algorithm presented requires an accurate word transcript of the utterances being analyzed. It places errors into one of the categories: 1) due to outof vocabulary (OOV) word spoken, 2) search error, 3) homophone substitution, 4) language model overwhelming correct acoustics, 5) transcript/pronunciation problems, 6) confused acoustic models, or 7) miscellaneous/not possible to categorize. Some categorizations of errors can supply training data to automatic corrective training methods that refine acoustic models. Other errors supply language model and lexicon designers with examples that identify potential improvements. The algorithm is described and results on the combined evaluation test sets from 19921995 of the North American Business (NAB) [1] [2] [3] corpus using the SphinxII recognizer [4] are presented.
ABSTRACT
Predicting speech recognition performance in place of expensive recognition experiments is a very useful approach for the research and development of speech recognition systems. In this paper, we propose a method to predict speech recognition performance when using new test data and/or a new acoustic model. Performance prediction tests showed that the proposed method can accurately predict recognition performance, thus saving a large amount of computer resources.
ABSTRACT
Voice Activity Detectors (VAD's) are widely used in speech technology applications where available transmission or storage capacity is limited (e.g. mobile, DCME, etc.) and must be utilised with maximum economy. Modern day digital speech coding algorithms can provide toll quality speech at bit-rates as low as 8kbit/s (e.g. ITU-T G.729) and the use of a VAD can achieve further economy in average bit-rate. This paper presents a modified version of the GSM VAD, for use with the ITU-T 8kbit/s speech coding algorithm CS-ACELP, which makes an active/inactive decision for every 10 ms coding frame. The performance of the proposed voice activity detector is compared to that of the GSM coder in terms of VAD errors and subjective quality. Results indicate that the modified VAD has similar performance to the standardised GSM VAD while operating with G.729 parameters and coding frame size.
ABSTRACT
We describe the development of an R&D recognizer for several Spanish applications, starting from an exist- ing recognition system for American English and modest language-specific resources. The experiments emphasize achieving phonetic accuracy on telephone speech without vocabulary specific training. We use our basic recogni- tion engine, and simple grammar-building tools for pre- dicting word sequences. Only the read sentences from two telephone speech corpora (Voice Across Hispanic Amer- ica (VAHA) and a smaller TI corpus) are used for train- ing. Word error rates (WER) of 1.9% on telephone service command phrases, 5.5% on telephone numbers, and 12% on continuously spoken sentences are achieved with the newly ported system.
ABSTRACT
Many speech applications, most prominently telephone directory assistance, require the recognition of proper names. However, the recognition of increasingly large sets of spoken names is dificult: Besides technical limitations, very large recognition vocabularies contain many easily confused words or even homophones. Therefore, proper names are often spelled or both spoken and spelled. In this paper we compare the performance for proper name recognition when a name is spoken only, spelled only, or both spoken and spelled. In the latter case, information about the same name is provided in two different representations. We address methods to exploit this redundancy and propose techniques to handle the recognition of large lists of spoken and spelled proper names.
ABSTRACT
This paper proposes a technique for compensating both static and dynamic parameters of continuous mixture density HMM to make it robust to noise. The technique is based on cepstral parameter generation from HMM using dynamic parameters. The generated cepstral vector sequences of speech and noise are combined to yield noisy speech cepstral vector sequence, and the dynamic parameters are calculated from the obtained cepstral vector sequence. Model parameters for noisy speech HMM are obtained using the statistics of the noisy speech parameter sequences. We use the mixture transition probability for estimating the parameters of the compensated model. Experimental results show the effectiveness of the proposed technique in the noisy speech recognition.
ABSTRACT
In this paper we study the influence of the sub-band adaptive filtering speech enhancement method on speech recognition systems in multi-source noisy environment using a speaker and a noise reference microphone. In extensive experiments, the recognition score of a speaker independent isolated word speech recognition system based on a continuous density HMM (CDHMM) has been measured in the presence of real life noises in various SNRs. In all experiments the results show improvement in the mean recognition score when the sub-band adaptive filtering LMS method is used in comparison to the full-band LMS method. This improvement increases when changing types of noise distort the speech signal.
ABSTRACT
To improve the robustness of speech recognition in additive noisy environments, an SVD based space transformation approach is proposed. It is shown that with this approach, not only the signal-to-noise ratio is improved but also a significant recognition error reduction is achieved. A multiple model based on the proposed method is developed and it can provide high recognition rate for a large range of SNRs. Recognition experiments on a speaker-dependent mono-syllabic database with additive noise show that, this new approach outperforms LPC cepstrum, MFCC, and OSALPC cepstrum significantly
ABSTRACT
A new robust algorithm for isolated word recognition in low SNR environments is suggested . The algorithm, called WSP, is described here for left to right models with no skips. It is shown that the algorithm outperforms the conventional HMM in the SNR range of 5 to 20db, and the PMC algorithm in the range 0 to -9db.