ABSTRACT
A method of modelling accent-specific pronunciation variations is presented. Speech from an unseen accent group is phonetically transcribed such that pronunciation variations may be derived. These context-dependent variations are clustered in decision trees which are used as a model of the pronunciation variation associated with this new accent group. The trees are then used to build a new pronunciation dictionary for use during the recognition process. Experiments are presented, based on Wall Street Journal and WSJCAM0 corpora, for the recognition of American speakers using a British English recogniser. Speaker independent as well as speaker dependent adaptation scenarios are presented, giving up to 20% reduction in word error rate. A linguistic analysis of the pronunciation model is presented and finally the technique is combined with maximum likelihood linear regression, a well proven acoustic adaptation technique, yielding further improvement.
ABSTRACT
In this paper, the acoustic and linguistic characteristics of children speech are investigated in the context of automatic speech recognition. Acoustic variability is identied as a major hurdle in building high performance ASR applications for children. A simple speaker normalization algorithm combining frequency warping and spectral shaping introduced in [5] is shown to reduce acoustic variability and signicantly improve recognition performance for children speakers (by 25{ 45%). Age-dependent acoustic modeling further reduces word error rate by 10%. Piece-wise linear and phoneme-dependent frequency warping algorithms are proposed for reducing acoustic mismatch between the children and adult acoustic spaces.
ABSTRACT
This paper deals with the problem of non-native accents in speech recognition. Reference tests were performed using whole-word and sub-word models trained either with a native accent or a pool of native and non-native accents. The results seem to indicate that the use of phonetic transcriptions for each specific accent may improve recognition scores with sub-word models. A data-driven process is used to derive transcription lattices. The recognition scores thus obtained were encouraging.
ABSTRACT
In spontaneous conversational speech there is a large amount of variability due to accents, speaking styles and speaking rates (also known as the speaking mode) [3]. Because current recognition systems usually use only a relatively small number of pronunciation variants for the words in their dictionaries, the amount of variability that can be modeled is limited. Increasing the number of variants per dictionary entry is the obvious solution. Unfortunately, this also means increasing the confusability between the dictionary entries, and thus often leads to an actual performance decrease. In this paper we present a framework for speaking mode dependent pronunciation modeling. The probability of encountering pronunciation variants is defined to be a function of the speaking style. The probability function is learned through decision trees from rule based generated pronunciation variants as observed on the Switchboard corpus. The framework is successfully applied to increase the performance of our state-of-the-art Janus Recognition Toolkit Switchboard recognizer significantly.
ABSTRACT
Speech disfluencies (filled pauses, repetitions, repairs, and false starts) are pervasive in spontaneous speech. The ability to detect and correct disfluencies automatically is important for effective natural language understanding, as well as to improve speech models in general. Previous approaches to disfluency detection have relied heavily on lexical information, which makes them less applicable when word recognition is unreliable. We have developed a disfluency detection method using decision tree classifiers that use only local and automatically extracted prosodic features. Because the model doesn't rely on lexical information, it is widely applicable even when word recognition is unreliable. The model performed significantly better than chance at detecting four disfluency types. It also outperformed a language model in the detection of false starts, given the correct transcription. Combining the prosody model with a specialized language model improved accuracy over either model alone for the detection of false starts. Results suggest that a prosody-only model can aid the automatic detection of disfluencies in spontaneous speech.
ABSTRACT
This paper presents a new training approach for improving recognition of speech under emotional and environmental stress. The proposed approach consists of training a speech recognizer with synthetically generated speech under each stress condition using stress perturbation models previously formulated in [4, 1]. The perturbation models were previously formulated to statistically model the parameter variations under angry, loud, and Lombard effect and were employed in an analysis-synthesis scheme for generating stressed synthetic speech from isolated neutral speech. In this paper, two training approaches employing the synthetically generated stressed speech are presented consisting of : speaker-independent, and speaker-adaptive training methods. Both approaches outperform neutral trained recognizers when tested with angry, loud, and Lombard effect speech.