Language and Speaker Identification

Home


Robust Spoken Language Identification using Large Vocabulary Speech Recognition

Authors:

James L. Hieronymus, Bell Labs, Murray Hill, NJ (U.S.A.)
Shubha Kadambe, Atlantic Aerospace, Greenbelt, MD (U.S.A.)

Volume 2, Page 1111

Abstract:

A robust, task independent spoken Language Identification (LID) system which uses a Large Vocabulary Continuous Speech Recognition (LVCSR) module for each language to choose the most likely language spoken is described. The acoustic analysis uses mean cepstral removal on mel scale cepstral coefficients to compensate for different input channels. The system has been trained on 5 languages: English, German, Japanese, Mandarin Chinese and Spanish using a subset of the Oregon Graduate Institute 11 language data base. The five language results show 88% correct recognition for 50 second utterances without using confidence measures and 98 % correct with confidence measures without the robust front end. The recognition rate is 81 % correct for 10 second utterances without confidence measures and 93 % correct with confidence measures without the robust front end. Adding the robust front end improves the recognition rate approximately 3 % on the short utterances and 1 % for the long utterances. The best performance has been obtained for systems trained on phonetically hand labeled speech.

ic971111.pdf

ic971111.pdf

TOP



Double Bigram-Decoding in Phonotactic Language Identification

Authors:

Jirí Navrátil, Technical University of Ilmenau (Germany)
Werner Zühlke, Technical University of Ilmenau (Germany)

Volume 2, Page 1115

Abstract:

In this paper a phonotactic language identification system that employs a multilingual phone-recognizer with multiple language-dependent grammars to tokenize the spoken signal into several phone-streams is described. For each stream an independent set of language models is used to compute the language scores that are subsequently processed by two classification stages. Thus, the system acquires information from both the original-label and the decoded-phone statistics. A discriminative weighting method is applied in the second stage for better distinguishing between similar languages. A modified language-bigram model, the so-called skip-gram, that allows exploiting of a wider phonotactic context without increasing the estimation costs of a standard bigram, is introduced. Measured on the NIST'95 evaluation set, the described system outperforms the state-of-the-art phonotactic components that use multiple recognizers, and is, at the same time, less computationally expensive.

ic971115.pdf

ic971115.pdf

TOP



Random Walk Theory Applied to Language Identification

Authors:

Etienne Marcheret, RPI (U.S.A.)
Michael I. Savic, RPI (U.S.A.)

Volume 2, Page 1119

Abstract:

In this paper we discuss the most recent evaluation of the RPI language identification system by the National Institute of Standards and Technologies (NIST). This system is based on an acousto-phonetic approach where the phonemes present in a language are identified by a hidden semi-Markov model (HSMM). The HSMM was also developed at RPI. Knowledge of these phonemes provides us with the necessary probabilistic framework for classifier design. The classifier used in this system is designed in such a way that language specific scores generated during an evaluation form a random walk. Random walk theory has extensive applications in ecology, metallurgy, chemistry and physics. Until recently random walk theory has been primarily used as a tool for the measurement of the territory covered by a diffusing particle. We now show that random walk theory can be used to effectively design a language identification system.

ic971119.pdf

ic971119.pdf

TOP



Frequency Characteristics of Foreign Accented Speech

Authors:

Levent M. Arslan, Duke University (U.S.A.)
John H.L. Hansen, Duke University (U.S.A.)

Volume 2, Page 1123

Abstract:

In this study, frequency characteristics of foreign accented speech is investigated. Experiments are conducted to discover the relative significance of different resonant frequencies and frequency bands in terms of their accent discrimination ability. It is shown that second and third formants are more important than other resonant frequencies. A filter bank analysis of accented speech supports this statement, where the 1500-2500 Hz range was shown to be the most significant frequency range in discriminating accented speech. Based on these results, a new frequency scale is proposed in place of the commonly used Mel-scale to extract the cepstrum coefficients from the speech signal. The proposed scale results in better performance for the problems of accent classification and language identification.

ic971123.pdf

ic971123.pdf

TOP



A Study on Improving Decisions in Closed Set Speaker Identification

Authors:

Mubeccel Demirekler, ODTU (Turkey)
Afsar Saranli, ODTU (Turkey)

Volume 2, Page 1127

Abstract:

In this study, closed-set, text-independent speaker identification is considered and the problem of improving the reliability of the decisions made by available algorithms is addressed. The work presented here is based on the idea of combining the evidences from different algorithms or decision strategies to improve the recognition performance and the reliability. For this purpose, the models generated by a single algorithm for 17 speakers from the SPIDRE database are considered and a matrix of speaker-to-model fitness values is processed by two different decision strategies. Ideas from the Mathematical Theory of Evidence are applied to combine the decisions produced by these two strategies to generate a better decision on the speaker identity. The combined decision show an improved degree of corectness hence suggesting a promising way of combining the decisions from partially successful algorithms.

ic971127.pdf

ic971127.pdf

TOP



The Use Of Harmonic Features In Speaker Recognition

Authors:

Bojan Imperl, University of Maribor (Slovenia)
Zdravko Kacic, University of Maribor (Slovenia)
Bogomir Horvat, University of Maribor (Slovenia)

Volume 2, Page 1131

Abstract:

In this paper the Harmonic features based on the harmonic decomposition of the Hildebrand - Prony line spectrum are introduced. A Hildebrand -- Prony method of spectral analysis was applied because of its high resolution and accuracy. Comparative tests with the LP and LP - cepstral features were made with 50 speakers from the Slovene database SNABI (isolated words corpus) and 50 speakers of the German database BAS Siemens 100 (utterances of sentences). With both databases the advantages of the Harmonic features were noticed especially for the speaker identification while for the speaker verification the Harmonic features have performed better on the SNABI database and as good as the LP cepstral features on the BAS Siemens 100 database.

ic971131.pdf

ic971131.pdf

TOP



An Approach to Speaker Identification Using Multiple Classifiers

Authors:

Vlasta Radová, University of West Bohemia (Czech Republic)
Josef Psutka, University of West Bohemia (Czech Republic)

Volume 2, Page 1135

Abstract:

Presented paper takes interest in a speaker identification problem. The attributes representing voice of a particular speaker are obtained from very short segments of the speech waveform corresponding only to one pitch period of vowels. The patterns formed from the samples of a pitch period waveform are either matched in time domain by use of a nonlinear time warping method, known as dynamic time warping (DTW), or they are converted into the cepstral coefficients and compared using the cepstral distance measure. Since an uttered speech signal usually contains a lot of vowels the techniques using a combination both various classifiers and multiple classifier outputs are considered in the decision making process. Experiments performed for hundred speakers are described at the end of this paper.

ic971135.pdf

ic971135.pdf

TOP