Isolated Word Recognition

Home
Full List of Titles
1: ICSLP'98 Proceedings
Keynote Speeches
Text-To-Speech Synthesis 1
Spoken Language Models and Dialog 1
Prosody and Emotion 1
Hidden Markov Model Techniques 1
Speaker and Language Recognition 1
Multimodal Spoken Language Processing 1
Isolated Word Recognition
Robust Speech Processing in Adverse Environments 1
Spoken Language Models and Dialog 2
Articulatory Modelling 1
Talking to Infants, Pets and Lovers
Robust Speech Processing in Adverse Environments 2
Spoken Language Models and Dialog 3
Speech Coding 1
Articulatory Modelling 2
Prosody and Emotion 2
Neural Networks, Fuzzy and Evolutionary Methods 1
Utterance Verification and Word Spotting 1 / Speaker Adaptation 1
Text-To-Speech Synthesis 2
Spoken Language Models and Dialog 4
Human Speech Perception 1
Robust Speech Processing in Adverse Environments 3
Speech and Hearing Disorders 1
Prosody and Emotion 3
Spoken Language Understanding Systems 1
Signal Processing and Speech Analysis 1
Spoken Language Generation and Translation 1
Spoken Language Models and Dialog 5
Segmentation, Labelling and Speech Corpora 1
Multimodal Spoken Language Processing 2
Prosody and Emotion 4
Neural Networks, Fuzzy and Evolutionary Methods 2
Large Vocabulary Continuous Speech Recognition 1
Speaker and Language Recognition 2
Signal Processing and Speech Analysis 2
Prosody and Emotion 5
Robust Speech Processing in Adverse Environments 4
Segmentation, Labelling and Speech Corpora 2
Speech Technology Applications and Human-Machine Interface 1
Large Vocabulary Continuous Speech Recognition 2
Text-To-Speech Synthesis 3
Language Acquisition 1
Acoustic Phonetics 1
Speaker Adaptation 2
Speech Coding 2
Hidden Markov Model Techniques 2
Multilingual Perception and Recognition 1
Large Vocabulary Continuous Speech Recognition 3
Articulatory Modelling 3
Language Acquisition 2
Speaker and Language Recognition 3
Text-To-Speech Synthesis 4
Spoken Language Understanding Systems 4
Human Speech Perception 2
Large Vocabulary Continuous Speech Recognition 4
Spoken Language Understanding Systems 2
Signal Processing and Speech Analysis 3
Human Speech Perception 3
Speaker Adaptation 3
Spoken Language Understanding Systems 3
Multimodal Spoken Language Processing 3
Acoustic Phonetics 2
Large Vocabulary Continuous Speech Recognition 5
Speech Coding 3
Language Acquisition 3 / Multilingual Perception and Recognition 2
Segmentation, Labelling and Speech Corpora 3
Text-To-Speech Synthesis 5
Spoken Language Generation and Translation 2
Human Speech Perception 4
Robust Speech Processing in Adverse Environments 5
Text-To-Speech Synthesis 6
Speech Technology Applications and Human-Machine Interface 2
Prosody and Emotion 6
Hidden Markov Model Techniques 3
Speech and Hearing Disorders 2 / Speech Processing for the Speech and Hearing Impaired 1
Human Speech Production
Segmentation, Labelling and Speech Corpora 4
Speaker and Language Recognition 4
Speech Technology Applications and Human-Machine Interface 3
Utterance Verification and Word Spotting 2
Large Vocabulary Continuous Speech Recognition 6
Neural Networks, Fuzzy and Evolutionary Methods 3
Speech Processing for the Speech-Impaired and Hearing-Impaired 2
Prosody and Emotion 7
2: SST Student Day
SST Student Day - Poster Session 1
SST Student Day - Poster Session 2

Author Index
A B C D E F G H I
J K L M N O P Q R
S T U V W X Y Z

Multimedia Files

Improving Accuracy of Telephony-based, Speaker-Independent Speech Recognition

Authors:

Daniel Azzopardi, BT Labs (U.K.)
Shahram Semnani, BT Labs (U.K.)
Ben Milner, BT Labs (U.K.)
Richard Wiseman, BT Labs (U.K.)

Page (NA) Paper number 543

Abstract:

A combination of techniques for increasing recognition accuracy has been developed for an automated corporate directory system with 120,000 entries. Using a traditional recogniser an accuracy of around 60% has previously been obtained for both a 156 town name task and 1108 road name task. Techniques presented in this paper comprise front-end modifications, context dependent models, improved lexicon and noise modelling. This resulted in an increased recognition accuracy of around 90%.

SL980543.PDF (From Author) SL980543.PDF (Rasterized)

TOP


Rejection in Speech Recognition Systems with Limited Training

Authors:

Aruna Bayya, Rockwell Semiconductor Systems, Newport Beach, CA (USA)

Page (NA) Paper number 572

Abstract:

In this paper, we propose a new rejection criterion applicable specifically to limited-training speech recognition systems such as Speaker-Dependent (SD) recognition systems. The new criterion uses confidence measures as well as speaker-specific out-of-vocabulary (OOV) models. The OOV models are created from the same training data that is available to create the in-vocabulary (IV) word models. We describe the method for creating these speaker-specific out-of-vocabulary models from limited training data. We also define a fairly robust confidence measure to reject the OOV words. The results presented in this paper demonstrate the effectiveness of the new criterion in a SD recognition task under various conditions.

SL980572.PDF (From Author) SL980572.PDF (Rasterized)

TOP


A Four Layer Sharing HMM System For Very Large Vocabulary Isolated Word Recognition

Authors:

Ruxin Chen, SONY Research Labs, San Jose (USA)
Miyuki Tanaka, SONY Research Labs, San Jose (USA)
Duanpei Wu, SONY Research Labs, San Jose (USA)
Lex Olorenshaw, SONY Research Labs, San Jose (USA)
Mariscela Amador, SONY Research Labs, San Jose (USA)

Page (NA) Paper number 583

Abstract:

This paper reports on a large vocabulary speaker independent isolated word recognizer targeting 50,000 words. The system supports a unique four-layer sharing structure for either continuous HMM or discrete HMM. Evaluation is performed using a dictionary of 5000 US city names, a dictionary of the 5000 English most frequent words, a dictionary of 50,000 English words, and the 110,000 word CMU English dictionary. For these dictionaries, recognition accuracy ranges from 90% to 93% for the top 3 results.

SL980583.PDF (From Author) SL980583.PDF (Rasterized)

TOP


A Comparative Study Of Hybrid Modelling Techniques For Improved Telephone Speech Recognition

Authors:

Rathinavelu Chengalvarayan, Lucent Technologies (USA)

Page (NA) Paper number 22

Abstract:

This paper presents a new technique for modelling heterogeneous data sources such as speech signals received via distinctly different channels which arises when an automatic speech recognition is deployed in wireless telephony in which highly heterogenous channels coexist and interoperate. The key problem is that a simple model may become inadequate to describe accurately the diversity of the signal, resulting in an unsatisfactory recognition performance. To cope up with this problem, different hybrid modelling techniques have been proposed and investigated in this paper by intelligently combining models from two different wireline and wireless environments.

SL980022.PDF (From Author) SL980022.PDF (Rasterized)

TOP


Smoothing and Tying for Korean Flexible Vocabulary Isolated Word Recognition

Authors:

Jae-Seung Choi, LG Corporate Institute of Technology (Korea)
Jong-Seok Lee, LG Corporate Institute of Technology (Korea)
Hee-Youn Lee, LG Corporate Institute of Technology (Korea)

Page (NA) Paper number 623

Abstract:

For large vocabulary recognition system, as well as for flexible vocabulary applications using hidden Markov model(HMM), parameter smoothing and tying have been used to increase the reliability of models. This paper describes bottom-up and top-down clustering techniques for state level tying. This paper also describes a method of applying parameter smoothing to the clustered states and covariance matrix of semicontinuous hidden Markov model(SCHMM). We applied co-occurrence smoothing method(CSM) for senone smoothing. We present a new parameter smoothing method and apply it to the distribution of discrete hidden Markov model(DHMM) in the training procedure. A new model composition method for unseen triphone modeling in bottom-up clustering is also proposed and compared with traditional context-independent model backing-off method.

SL980623.PDF (From Author) SL980623.PDF (Rasterized)

TOP


Recent Work on a Preselection Module for a Flexible Large Vocabulary Speech Recognition System in Telephone Environment

Authors:

Javier Ferreiros, Grupo de Tecnología del Habla. Departamento de Ingeniería Electrónica. ETSI Telecomunicacion. Universidad Politecnica de Madrid (Spain)
Javier Macías-Guarasa, Grupo de Tecnología del Habla. Departamento de Ingeniería Electrónica. ETSI Telecomunicacion. Universidad Politecnica de Madrid (Spain)
Ascensión Gallardo, Grupo de Tecnología del Habla. Departamento de Ingeniería Electrónica. ETSI Telecomunicacion. Universidad Politecnica de Madrid (Spain)
José Colás, Grupo de Tecnología del Habla. Departamento de Ingeniería Electrónica. ETSI Telecomunicacion. Universidad Politecnica de Madrid (Spain)
Ricardo Córdoba, Grupo de Tecnología del Habla. Departamento de Ingeniería Electrónica. ETSI Telecomunicacion. Universidad Politecnica de Madrid (Spain)
José Manuel Pardo, Grupo de Tecnología del Habla. Departamento de Ingeniería Electrónica. ETSI Telecomunicacion. Universidad Politecnica de Madrid (Spain)
Luis Villarrubia, Telefonica Investigacion y Desarrollo (Spain)

Page (NA) Paper number 987

Abstract:

At ICSLP'96 we presented a flexible, large vocabulary, speaker independent, isolated-word preselection system in a telephone environment, using a two stage, bottom-up strategy. We achieved reasonable performance in large and very large vocabulary tasks, ranging from 1200 to 10000 words. In this paper, we will describe recent studies we have carried out on the system, aimed in two directions: handling of non speech sounds in the speech signal (we consider lips, respiration and click noises); and making the preselection lists dynamic in length, to reduce computational load, in the average. In the first case, we want to model non speech sounds, as these effects are crucial in real-life situations, leading to wrong endpointing and increasing error rates. In the second, we are interested in integrating any available system parameter to calculate the preselection list length to use, having applied both parametric and non parametric methods.

SL980987.PDF (From Author) SL980987.PDF (Rasterized)

TOP


A Study of Noise Robustness for Speaker Independent Speech Recognition Method Using Phoneme Similarity Vector

Authors:

Masakatsu Hoshimi, Matsushita Research Institute Tokyo, Inc. , Tohoku University (Japan)
Maki Yamada, Matsushita Research Institute Tokyo, Inc. (Japan)
Katsuyuki Niyada, Central Research Labs., Matsushita Electric Industrial Co., Ltd. (Japan)
Shozo Makino, Tohoku University (Japan)

Page (NA) Paper number 257

Abstract:

As an input method for rapidly spreading small portable information devices, development of speaker independent speech recognition technology which can be embedded on a single DSP is now urgently requested. We have reported a speech recognition method using phoneme similarity vector as a feature vector, which is quite robust for reduction of precision of the feature parameter. We've also developed a recognition board with a single DSP, which works 100-word vocabulary using only the internal memory inside the DSP. [1][2] In this report, we propose a new technique which makes our recognition method more robust, where a newly introduced noise standard template together with traditional phoneme standard templates for calculating phoneme similarity vector realizes precise word-spotting. When the newly proposed noise robustness method was tested with 100 isolated word vocabulary speech of 50 subjects, recognition accuracy of 94.7% was obtained under various noisy environments.

SL980257.PDF (From Author) SL980257.PDF (Rasterized)

TOP


Classification of Taiwanese Tones Based on Pitch and Energy Movements

Authors:

Fran H.L. Jian, Dept. Linguistics, University of Reading (U.K.)

Page (NA) Paper number 146

Abstract:

This paper addresses the difficulties associated with automatically distinguishing the seven Taiwanese tones. The tone recogniser is an essential component of any automatic speech recognition system customised for tone languages such as Taiwanese. We show that it is difficult to distinguish between the Taiwanese tones simply employing the fundamental frequency contours and that the task is simplified by employing energy contour features besides the fundamental frequency features. To allow energy to be accommodated into the classification model an energy-contour feature extraction approach is presented. The proposed approach is inspired by the ADSR model employed in musical instrument synthesis where the envelopes of complex sounds are modeled employing only a few parameters. Our experiments demonstrate that the inclusion of energy into the recognition model allows the seven Taiwanese tones to be discriminated successfully. The paper also presents acoustical measurements of the fundamental frequency and energy features described

SL980146.PDF (From Author) SL980146.PDF (Rasterized)

TOP


Phoneme-Based Recognition for the Norwegian SpeechDat(II) Database

Authors:

Finn Tore Johansen, Telenor R&D (Norway)

Page (NA) Paper number 889

Abstract:

This paper presents results from a number of flexible vocabulary recognition experiments on the Norwegian SpeechDat(II) database. A common phoneme-based recogniser design procedure is tested on five different tasks, and for five different training sets. Results verify that reasonably accurate recognisers can be built with the database, using standard HMM techniques. They also quantify the importance of training set selection for small and medium vocabulary tasks.

SL980889.PDF (From Author) SL980889.PDF (Rasterized)

TOP


Robust Feature Extraction for Alphabet Recognition

Authors:

Montri Karnjanadecha, Old Dominion University (USA)
Stephen A. Zahorian, Old Dominion University (USA)

Page (NA) Paper number 1110

Abstract:

Spectral/temporal segment features are adapted for isolated word recognition and tested with the entire English alphabet set using Hidden Markov Models. The ISOLET database from OGI and the HTK toolkit from Cambridge university were used to test our feature extraction technique. With our feature set we were able to achieve 97.3% recognition accuracy on test data with one pass using a whole word based recognizer. Gaussian noise was also added to evaluate robustness of the feature set. We were able to obtain recognition accuracies of 49.6% and 84.3% at SNR of -10dB and 0dB, respectively. Linear discriminant analysis was also applied to the initial feature set for a number of feature configurations and noise levels but, generally, the performance was not improved. We conclude that the initial feature computations used are both very efficient (best results obtained with 50 total features) and robust in the presence of noise.

SL981110.PDF (From Author) SL981110.PDF (Rasterized)

TOP


Recognition of Connected Digit Speech in Japanese Collected over the Telephone Network

Authors:

Hisashi Kawai, KDD R&D Laboratories Inc. (Japan)
Norio Higuchi, KDD R&D Laboratories Inc. (Japan)

Page (NA) Paper number 694

Abstract:

This paper describes experimental results on whole word HMM-based speech recognition of connected digits in Japanese collected through the telephone network. The training data comprises 756860 digits uttered by 1963 speakers, while the testing data comprises 304212 digits uttered by 852 speakers. The best performance was a word error rate of 0.42% for known length strings obtained using context dependent models. The word error rate was measured as a function of the training data size. The result showed that at least 3302 samples per speaker and 344 speakers are necessary and sufficient for context independent training. Error analysis was conducted on a fraction of the population bearing the major part of recognition errors. The results suggested that such speakers arise not simply from speaker characteristics but from a combination of speaker characteristics and environmental conditions.

SL980694.PDF (From Author) SL980694.PDF (Rasterized)

TOP


Improving the Speaker-Dependency of Subword-Unit-Based Isolated Word Recognition

Authors:

Takuya Koizumi, Dept. of Information Science, Fukui University (Japan)
Shuji Taniguchi, Dept. of Information Science, Fukui University (Japan)
Kazuhiro Kohtoh, Dept. of Information Science, Fukui University (Japan)

Page (NA) Paper number 51

Abstract:

This paper deals with a subword-unit-based isolated word recognition system with enhanced speaker-independency. The subword is defined as a part of word whose central portion has rather stationary or time-invariant short-time spectra with its portions near its ends having rapidly varying short-time spectra. In this system each isolated word is decomposed into a sequence of subwords, each of which is identified by means of a particular semi-continuous hidden Markov model that is named a subword HMM. Each isolated word is recognized by a particular set of concatenated subword HMMs that is designated as a word HMM. Subword boundaries within a word are detected by finding peaks of the magnitude of delta cepstral vectors obtained from the word. The system attains average word recognition rates over 87 % for a number of Japanese words uttered by ten native male speakers.

SL980051.PDF (From Author) SL980051.PDF (Scanned)

TOP


Speaker Independent Speech Recognition Method using Constrained Time Alignment near Phoneme Discriminative Frame

Authors:

Tomohiro Konuma, Matsushita Research Institute Tokyo, Inc. (Japan)
Tetsu Suzuki, Matsushita Research Institute Tokyo, Inc. (Japan)
Maki Yamada, Matsushita Research Institute Tokyo, Inc. (Japan)
Yoshio Ohno, Matsushita Research Institute Tokyo, Inc. (Japan)
Masakatsu Hoshimi, Matsushita Research Institute Tokyo, Inc. Tohoku University (Japan)
Katsuyuki Niyada, Matsushita Electric Industrial Co.,Ltd. (Japan)

Page (NA) Paper number 198

Abstract:

We present constrained time alignment acoustic models based on phonetic knowledge and a speaker independent speech recognition method using our proposed models. Japanese syllable and isolated word recognition experiments show that the models have robustness to intra- and inter- speaker varieties such as acoustic diversity. Furthermore we experiment with word recognition tests under the condition such as noise environments and endpoints free matching, it reveals the feasibility of our proposed models.

SL980198.PDF (From Author) SL980198.PDF (Rasterized)

TOP


A Nonstationary Autoregressive HMM With Gain Adaptation For Speech Recognition

Authors:

Ki Yong Lee, Soongsil University (Korea)
Joohun Lee, Dong-Ah Broadcasting College (Korea)

Page (NA) Paper number 408

Abstract:

In this paper, a time domain approach for speech recognition is developed. The nonstationary autoregressive (AR) hidden markov model (HMM) with gain contour is proposed for modeling the statistical characteristics of the speech signal. The parameter of nonstationary AR model was modeled by the polynomial function with linear combination of M known basis functions. In this proposed model, speech signal is blocked by samples into fixed-length frames and modeled by nonstationary AR model controlled by markov switching sequences at each frame. Given the HMM parameter set of the speech, the gain-adapted recognition algorithm is developed for speech recognition.

SL980408.PDF (From Author) SL980408.PDF (Rasterized)

TOP


A Large-Vocabulary Taiwanese (MIN-NAN) Multi-Syllabic Word Recognition System Based Upon Right-Context-Dependent Phones with State Clustering by Acoustic Decision Tree

Authors:

Ren-yuan Lyu, Chang Gung University (Taiwan)
Yuang-jin Chiang, National Tsing Hua University (Taiwan)
Wen-ping Hsieh, National Tsing Hua University (Taiwan)

Page (NA) Paper number 80

Abstract:

In this paper, we apply context dependent phonetic modeling on the task of large vocabulary (with 20 thousand words) Taiwanese multi-syllabic word recognition. Considering the phonetic characteristics of Taiwanese, the right context dependent (RCD) phones instead of the general tri-phones are used. The RCDs are further clustered at the sub-phone or state level using a decision tree with a set of context-split questions specially designed for Taiwanese speech according to the acoustic/phonetic knowledge. For the speaker dependent case, 7.18% word error rate is achieved. A real-time prototype system implemented on a Pentium-II personal computer running MS-Windows95/ NT is also shown to validate the approaches proposed here.

SL980080.PDF (From Author) SL980080.PDF (Rasterized)

TOP


Speech Recognition Based on the Distance Calculation Between Intermediate Phonetic Code Sequences in Symbolic Domain

Authors:

Kazuyo Tanaka, Electrotechnical Laboratory (Japan)
Hiroaki Kojima, Electrotechnical Laboratory (Japan)

Page (NA) Paper number 966

Abstract:

This paper proposes a speech recognition method alternative to the conventional sample-based statistical methods which are characterized by the necessity of large amounts of training speech data. To resolve this type of heavy processing, the proposed method employs an intermediate phonetic code system and the calculation of distance between phonetic code sequences in symbolic domain. It realizes high efficiency when compared with direct processing of acoustic correlates, although some deterioration will be expected in recognition scores. We first describe the distance calculation method and present specific procedures for obtaining the intermediate code sequence from input utterances and for spotting words using the calculation of distance in the symbolic domain. Preliminary experiments were examined on isolated word recognition and phrase spotting in continuous speech. Word recognition results indicate that the recognition scores obtained by the proposed method are comparable compared with ordinary phone-HMM-based speech recognition.

SL980966.PDF (From Author) SL980966.PDF (Rasterized)

TOP


High Accuracy Chinese Speech Recognition Approach with Chinese Input Technology for Telecommunication Use

Authors:

York Chung-Ho Yang, Matsushita Electric Institute of Technology, Taipei (Taiwan)
June-Jei Kuo, Matsushita Electric Institute of Technology, Taipei (Taiwan)

Page (NA) Paper number 505

Abstract:

Phoneme-oriented input with synchronised auto-revision to pictographic nature of written Chinese and Chinese speech recognition are two subjects not often brought together in the same article nor even the same proposition. This paper explores the growing relations between these two entities and, in particular, investigates what is found in integration the Chinese phonetic input with its auto-revision methods (Kuo; 1986, 1987, 1995, 1996) and Chinese isolated word, continuous speech recognition for portable device such as mobile telephone. Chinese phonetic input with a synchronised auto-revision approach integrates with a small size, high recognition rate Chinese speech recognition kernel for DSP single chip application will be introduced in this paper. Chinese phrase taxonomy has been defined and the definition is ready to be obtained from the system's dictionary.

SL980505.PDF (From Author) SL980505.PDF (Rasterized)

TOP