Speaker and Language Recognition 1

Home
Full List of Titles
1: ICSLP'98 Proceedings

Keynote Speeches

Text-To-Speech Synthesis 1

Spoken Language Models and Dialog 1

Prosody and Emotion 1

Hidden Markov Model Techniques 1

Speaker and Language Recognition 1

Multimodal Spoken Language Processing 1

Isolated Word Recognition

Robust Speech Processing in Adverse Environments 1

Spoken Language Models and Dialog 2

Articulatory Modelling 1

Talking to Infants, Pets and Lovers

Robust Speech Processing in Adverse Environments 2

Spoken Language Models and Dialog 3

Speech Coding 1

Articulatory Modelling 2

Prosody and Emotion 2

Neural Networks, Fuzzy and Evolutionary Methods 1

Utterance Verification and Word Spotting 1 / Speaker Adaptation 1

Text-To-Speech Synthesis 2

Spoken Language Models and Dialog 4

Human Speech Perception 1

Robust Speech Processing in Adverse Environments 3

Speech and Hearing Disorders 1

Prosody and Emotion 3

Spoken Language Understanding Systems 1

Signal Processing and Speech Analysis 1

Spoken Language Generation and Translation 1

Spoken Language Models and Dialog 5

Segmentation, Labelling and Speech Corpora 1

Multimodal Spoken Language Processing 2

Prosody and Emotion 4

Neural Networks, Fuzzy and Evolutionary Methods 2

Large Vocabulary Continuous Speech Recognition 1

Speaker and Language Recognition 2

Signal Processing and Speech Analysis 2

Prosody and Emotion 5

Robust Speech Processing in Adverse Environments 4

Segmentation, Labelling and Speech Corpora 2

Speech Technology Applications and Human-Machine Interface 1

Large Vocabulary Continuous Speech Recognition 2

Text-To-Speech Synthesis 3

Language Acquisition 1

Acoustic Phonetics 1

Speaker Adaptation 2

Speech Coding 2

Hidden Markov Model Techniques 2

Multilingual Perception and Recognition 1

Large Vocabulary Continuous Speech Recognition 3

Articulatory Modelling 3

Language Acquisition 2

Speaker and Language Recognition 3

Text-To-Speech Synthesis 4

Spoken Language Understanding Systems 4

Human Speech Perception 2

Large Vocabulary Continuous Speech Recognition 4

Spoken Language Understanding Systems 2

Signal Processing and Speech Analysis 3

Human Speech Perception 3

Speaker Adaptation 3

Spoken Language Understanding Systems 3

Multimodal Spoken Language Processing 3

Acoustic Phonetics 2

Large Vocabulary Continuous Speech Recognition 5

Speech Coding 3

Language Acquisition 3 / Multilingual Perception and Recognition 2

Segmentation, Labelling and Speech Corpora 3

Text-To-Speech Synthesis 5

Spoken Language Generation and Translation 2

Human Speech Perception 4

Robust Speech Processing in Adverse Environments 5

Text-To-Speech Synthesis 6

Speech Technology Applications and Human-Machine Interface 2

Prosody and Emotion 6

Hidden Markov Model Techniques 3

Speech and Hearing Disorders 2 / Speech Processing for the Speech and Hearing Impaired 1

Human Speech Production

Segmentation, Labelling and Speech Corpora 4

Speaker and Language Recognition 4

Speech Technology Applications and Human-Machine Interface 3

Utterance Verification and Word Spotting 2

Large Vocabulary Continuous Speech Recognition 6

Neural Networks, Fuzzy and Evolutionary Methods 3

Speech Processing for the Speech-Impaired and Hearing-Impaired 2

Prosody and Emotion 7
2: SST Student Day

SST Student Day - Poster Session 1

SST Student Day - Poster Session 2

Author Index
A B C D E F G H I
J K L M N O P Q R
S T U V W X Y Z

Multimedia Files

Sub-Band Based Speaker Verification Using Dynamic Recombination Weights

Authors:

Perasiriyan Sivakumaran, University of Hertfordshire (U.K.)
Aladdin M. Ariyaeeinia, University of Hertfordshire (U.K.)
Jill A. Hewitt, University of Hertfordshire (U.K.)

Page (NA) Paper number 1055

Abstract:

The concept of splitting the entire frequency domain into sub-bands and processing the spectra in these bands independently in between every consecutive recombination stage to generate a final score has recently been proposed for speech recognition. Some of the aspects of this technique have also been studied for the task of speaker recognition. A remaining critical problem in this approach, however, is the determination of appropriate recombination weights. This paper presents a new method for generating these weights for sub-band based speaker verification. The approach is based on the use of background speaker models and aims to reduce the effect of any mismatch between the band-limited segments of the test utterance and the corresponding sections in the target speaker model. The paper also discusses a problem generally associated with the sub-band cepstral features and outlines a possible solution.

SL981055.PDF (From Author) SL981055.PDF (Rasterized)

TOP

Measuring the Dynamic Encoding of Speaker Identity and Dialect in Prosodic Parameters

Authors:

Michael Barlow, University of NSW/ADFA (Australia)
Michael Wagner, University of Canberra (Australia)

Page (NA) Paper number 979

Abstract:

This paper describes a methodology for analysing the dynamic encoding of identity and dialect in prosodic parameters. Properties of the well-known DTW (Dynamic Time Warping) path of best match allow the separation of dynamic from static properties of acoustic parameters. A database of 19 speakers of Australian English was recorded and F0, energy, zero crossing rate and voicing contours extracted. Discriminant analysis figures measured identity encoding, while correlation rates measured dialect. Dynamic encoding levels were found to be significantly higher than static for both speaker characteristics (identity: 75% versus 55%; dialect: 0.58 versus 0.45).Normalisation of acoustic parameters into the range 0--1, eliminating all static information, reduced encoding levels to 70% (identity) and 0.52 (dialect) showing the robustness of dynamic encoding. Contrasting DTW warp path properties with the DTW distance showed the warp path a significantly better extractor of encoded information (72% versus 54% for identity; 0.45 versus 0.30 for dialect).

SL980979.PDF (From Author) SL980979.PDF (Rasterized)

TOP

German Regional Variants - A Problem for Automatic Speech Recognition?

Authors:

Nicole Beringer, Institute of Phonetics - University of Munich (Germany)
Florian Schiel, Institute of Phonetics - University of Munich (Germany)
Peter Regel-Brietzmann, Daimler-Benz AG - Ulm (Germany)

Page (NA) Paper number 201

Abstract:

A well known problem in automatic speech recognition (ASR) is robustness against the variability of speech between speakers. There are several ways to normalise different speakers; one of them is to deal with the problem of regional variation. In this paper we discuss the problem of whether moderate regional variants of German influence the ASR process and whether there is a way to improve performance through knowledge of the regional origin of the unknown speaker. The basic idea is to cluster test speakers into distinct dialectal regions and derive observations about the typical pronunciation within these regions from a classified training set. In a cheating experiment where the origin of the test speakers is known we verify whether the use of the dialect-specific pronunciation forms will improve the overall performance. It turns out that simply using dialect-specific pronunciation does not significantly improve word accuracy on the VERBMOBIL task.

SL980201.PDF (From Author) SL980201.PDF (Rasterized)

TOP

Improving Accent Identification Through Knowledge Of English Syllable Structure

Authors:

Kay Berkling, MIT, Lincoln Laboratory (USA)
Marc A. Zissman, MIT Lincoln Laboratory (USA)
Julie Vonwiller, Sydney University (Australia)
Christopher Cleirigh, Sydney University (Australia)

Page (NA) Paper number 394

Abstract:

This paper studies the structure of foreign-accented read English speech. A system for accent identification is constructed by combining linguistic theory with statistical analysis. Results demonstrate that the linguistic theory is reflected in real speech data and its application improves accent identification. The work discussed here combines and applies previous research in language identification based on phonemic features with the analysis of the structure and function of the English language. Working with phonemically handlabelled data in three accented speaker groups of Australian English (Vietnamese, Lebanese, and native speakers), we show that accents of foreign speakers can be predicted and manifest themselves differently as a function of their position within the syllable. When applying this knowledge, English vs. Vietnamese accent identification improves from 86% to 93% (English vs. Lebanese improves from 78% to 84%). The described algorithm is also successfully applied to automatically aligned phonemes.

SL980394.PDF (From Author) SL980394.PDF (Rasterized)

TOP

Multi-Dimensional Scaling of Listener Responses to Complex Auditory Stimuli

Authors:

Z.S. Bond, Ohio University (USA)
Donald Fucci, Ohio University (USA)
Verna Stockmal, Ohio University (USA)
Douglas McColl, Ohio University (USA)

Page (NA) Paper number 163

Abstract:

This study explored the attributes of languages to which listeners attend, using magnitude estimation and multidimensional scaling techniques. In magnitude estimation. listeners assign any numerical value to a set of stimuli. In response to the question: How similar is this language to English'? fifty college students assigned numerical values to spoken samples of foreign languages. The languages represented Europe, Asia and Africa. Differences between the mean ratings for each language and English were used to construct a proximity matrix which was submitted to MDS analysis. The optimum solution employed three dimensions. The first dimension was interpreted as "familiarity," the second as "speaker affect," and the third as "prosodic pattern." The MDS maps suggest that listeners were using English as a standard of comparison to the acoustic-phonetic properties of other languages. The maps resulted from the relationship between each language and the standard. and speaker and language characteristics which listeners found salient.

SL980163.PDF (Scanned)

TOP

Same Talker, Different Language

Authors:

Verna Stockmal, Ohio University (USA)
Danny R. Moates, Ohio University (USA)
Z.S. Bond, Ohio University (USA)

Page (NA) Paper number 164

Abstract:

Listeners can easily say whether a language they are hearing is familiar or foreign to them. Infants, young children, and adults are able to make same-language, different-language judgments at better than chance levels. In many of these studies, foreign language samples have been provided by different talkers so that language and talker characteristics have been confounded. We conducted three experiments using the same talker for different pairs of language. Listeners were able to discriminate between two languages they do not know even when spoken by the same talker, suggesting that listeners can distinguish talker characteristics from language characteristics.

SL980164.PDF (Scanned)

TOP

The Impact of Regional Variety Upon Specific Word Categories in Spontaneous German

Authors:

Susanne Burger, Departement of Phonetics and Speech Comunication, University of Munich (Germany)
Daniela Oppermann, Departement of Phonetics and Speech Comunication, University of Munich (Germany)

Page (NA) Paper number 508

Abstract:

The aim of the work to be reported here is to verify that pronunciation variety in German spontaneous speech appears in specific linguistic word categories (POS) while others are less affected. Additionally, the material (transliterations of spontaneous monologues of the RVG1-corpus [5]) allows a detailed view on pronunciation regarding the speakers' origin within the German-speaking regions and regarding the topic a speaker is talking about. Our results show that generally the most affected parts of speech are the auxiliary verbs. The regions with the highest deviation rates of standard German are Switzerland and Austria, and the regions in southern Germany. We can only add the vague answer to the question, whether the semantically topic of a story may have an influence upon the deviation of the standard language, that our results show a striking less frequency of pronunciation variants, when people speak about their jobs.

SL980508.PDF (From Author) SL980508.PDF (Rasterized)

TOP

Speech Pre-Processing Against Intentional Imposture In Speaker Recognition

Authors:

Dominique Genoud, IDIAP, CH-1920 CP 592 Martigny (Switzerland)
Gérard Chollet, CNRS URA-820 ENST, Rue Barrault 46 75634 Paris (France)

Page (NA) Paper number 734

Abstract:

Recently, some large-scale text dependent speaker verification systems have been tested. They show that less than 1% Equal Error Rate can be obtained on a test set score distribution. So far, the majority of impostor tests are performed using speakers who don't really try to fool the system. This can be explained by the lack of databases recorded for this purpose, and the difficulty for a normal speaker to transform his voice characteristics. Nevertheless, actual automatic analysis/synthesis techniques, such as Harmonic plus Noise Model (H+N), allows very good speech/speaker transformations. Thus, it becomes possible to transform the voice of a speaker in the voice of another speaker in order to make voluntary impostures. This paper evaluates these kind of intrusive impostures and proposes a new speech pre-processing method, based on harmonic subtraction, making speaker verification less insensitive to these spectral transformations. A state-of-the-art Hidden Markov Model is used as reference system to assess the transformation results. The speech is parameterised by LPCC coefficients. The results are obtained on a database of telephone speech quality. The speaker verification system works in text dependent mode.

SL980734.PDF (From Author) SL980734.PDF (Rasterized)

0734_01.WAV (was: 734_1.wav)	source sound File type: Sound File Format: Sound File: WAV Tech. description: 8khz, 16 bit mono adpcm Creating Application:: cool96 Creating OS: win95
0734_02.PDF (was: 734_2.jpg)	source spectrogram File type: Image File Format: Image : JPEG Tech. description: None Creating Application:: lview Creating OS: win95
0734_03.WAV (was: 734_3.wav)	target sound File type: Sound File Format: Sound File: WAV Tech. description: 8khz, 16 bit mono adpcm Creating Application:: cool96 Creating OS: win95
0734_04.PDF (was: 734_4.jpg)	target spectrogram File type: Image File Format: Image : JPEG Tech. description: None Creating Application:: lview Creating OS: win95
0734_05.WAV (was: 734_5.wav)	noise source hmm transformation File type: Sound File Format: Sound File: WAV Tech. description: 8khz, 16 bit mono adpcm Creating Application:: cool96 Creating OS: win95
0734_06.PDF (was: 734_6.jpg)	source noise hmm transformation spectrogram File type: Image File Format: Image : JPEG Tech. description: None Creating Application:: lview Creating OS: win95
0734_07.WAV (was: 734_7.wav)	random background noise hmm transformation File type: Sound File Format: Sound File: WAV Tech. description: 8khz, 16 bit mono adpcm Creating Application:: cool96 Creating OS: win95
0734_08.PDF (was: 734_8.jpg)	random background nois hmm transformation spectrogram File type: Image File Format: Image : JPEG Tech. description: None Creating Application:: lview Creating OS: win95
0734_09.WAV (was: 734_9.wav)	original sound File type: Sound File Format: Sound File: WAV Tech. description: None Creating Application:: cool edit Creating OS: win95
0734_10.PDF (was: 734_10.jpg)	original spectrogram File type: Image File Format: Image : JPEG Tech. description: None Creating Application:: lview Creating OS: win95
0734_11.WAV (was: 734_11.wav)	harmonics subtraction sound File type: Sound File Format: Sound File: WAV Tech. description: 8khz, 16 bit mono adpcm Creating Application:: cool edit Creating OS: win95
0734_12.PDF (was: 734_12.jpg)	harmonics subtraction spectrogram File type: Image File Format: Image : JPEG Tech. description: None Creating Application:: lview Creating OS: win95

TOP

A Comparison of Two Unsupervised Approaches to Accent Identification

Authors:

Mike Lincoln, University of East Anglia (U.K.)
Stephen Cox, University of East Anglia (U.K.)
Simon Ringland, British Telecom Laboratories (U.K.)

Page (NA) Paper number 465

Abstract:

The ability to automatically identify a speaker's accent would be very useful for a speech recognition system as it would enable the system to use both a pronunciation dictionary and speech models specific to the accent, techniques which have been shown to improve accuracy. Here, we describe some experiments in unsupervised accent classification. Two techniques have been investigated to classify British- and American-accented speech: an acoustic approach, in which we analyse the pattern of usage of the distributions in the recogniser by a speaker to decide on his most probable accent, and a high-level approach in which we use a phonotactic model for classification of the accent. Results show that both techniques give excellent performance on this task which is maintained when testing is done on data from an independent dataset.

SL980465.PDF (From Author) SL980465.PDF (Rasterized)

TOP

The Influence of Accents in Australian English Vowels and their Relation to Articulatory Tract Parameters

Authors:

Dominik R. Dersch, University of Sydney, Department of Electrical Engineering (Australia)
Christopher Cleirigh, University of Sydney, Department of Electrical Engineering (Australia)
Julie Vonwiller, University of Sydney, Department of Electrical Engineering (Australia)

Page (NA) Paper number 188

Abstract:

We analyse and compare a low dimensional linguistic representation of vowels with high dimensional prototypical vowel templates derived from native Australian English speaker. To simplify the problem, the study is restricted to a group of short and long vowels. In a low dimensional linguistic representation a vowel is represented by the horizontal and vertical position of the part of the tongue involved in the key articulation of a particular vowel, e.g., high or low and front or back. To this is added lip posture, spread or rounded. For a comparison we perform a multidimensional scaling transformation of high dimensional vowel clusters derived from speech samples. We further performed the same analysis on Lebanese and Vietnamese accented English to investigate how differences due to accents impact on such a representation.

SL980188.PDF (From Author) SL980188.PDF (Rasterized)

TOP

Automatic Language Recognition Using High-Order HMMs

Authors:

J.A. du Preez, University of Stellenbosch (South Africa)
D.M. Weber, University of Stellenbosch (South Africa)

Page (NA) Paper number 1074

Abstract:

We present automatic language recognition results using high-order hidden Markov models (HMM) and the recently developed ORder rEDucing (ORED) and Fast Incremental Training (FIT) HMM algorithms. We demonstrate the efficiency and accuracy of pseudo-phoneme context and duration modelling mixed-order HMMs as well as fixed order HMMs over conventional approaches. For a two language problem, we show that a third-order FIT trained HMM gives a test set accuracy of 97.4% compared to 89.7% for a conventionally trained third-order HMM. A first-order model achieved 82.1% accuracy on the same problem.

SL981074.PDF (From Author) SL981074.PDF (Rasterized)

TOP

Speaker Recognition Using Residual Signal Of Linear and Nonlinear Prediction Models

Authors:

Marcos Faúndez-Zanuy, Escola Universitaria Politecnica de Mataro (Spain)
Daniel Rodríguez-Porcheron, Universidad Politecnica de Catalunya (Spain)

Page (NA) Paper number 1102

Abstract:

This Paper discusses the usefullness of the residual signal for speaker recognition. It is shown that the combination of both a measure defined over LPCC coefficients and a measure defined over the energy of the residual signal gives rise to an improvement over the classical method which considers only the LPCC coefficients. If the residual signal is obtained from a linear prediction analisys, the improvement is 2.63% (error rate drops from 6.31% to 3.68%) and if it is computed through a nonlinear predictive neural nets based model, the improvement is 3.68%.

SL981102.PDF (From Author) SL981102.PDF (Rasterized)

TOP

An Implementation and Evaluation of an On-Line Speaker Verification System for Field Trials

Authors:

Yong Gu, Vocalis Ltd. (U.K.)
Trevor Thomas, Vocalis Ltd. (U.K.)

Page (NA) Paper number 70

Abstract:

This paper presents a HMM-based speaker verification system which was implemented for a field trial. One of the challenges for moving HMM from speech recognition to speaker verification is to understand the HMM score variation and to define a proper measurement which is comparable across speech samples. In this paper we define two basic verification measurements, a qualifier-based measurement and a competition-based measurement, and examine score normalisation approaches using these two measurements. This leads to some useful theoretical differentiation between cohort model and world model approaches used for HMM score normalisation. We adopted a world model method for score normalisation in the system. The adaptive variance flooring technique is also implemented in the system. The paper presents evaluation results of the implementation.

SL980070.PDF (From Author) SL980070.PDF (Rasterized)

TOP

Speaker Verification on the Polycost Database Using Frequency Filtered Spectral Energies

Authors:

Javier Hernando, Polytechnical University of Catalonia (Spain)
Climent Nadeu, Polytechnical University of Catalonia (Spain)

Page (NA) Paper number 724

Abstract:

The spectral parameters that result from filtering the frequency sequence of log mel-scaled filter-bank energies with a first or second order FIR filter have proved to be competitive for speech recognition. Recently, the authors have shown that this frequency filtering can approximately equalize the cepstrum variance enhancing the oscillations of the spectral envelope curve that are most effective for discrimination between speakers. Even better speaker identification results than using mel-cepstrum were observed on the TIMIT database, especially when white noise was added. In this paper, the hybridization of both linear prediction and filter-bank spectral analysis using either cepstral transformation or the alternative frequency filtering is explored for speaker verification. This combination, that had shown to be able to outperform the conventional techniques in clean and noisy word recognition, has yield good text-dependent speaker verification results on the new speaker-oriented telephone-line POLYCOST database.

SL980724.PDF (From Author) SL980724.PDF (Rasterized)

TOP

A High-Performance Text-Independent Speaker Identification System Based on BCDM

Authors:

Qin Jin, Tsinghua University (China)
Luo Si, Tsinghua University (China)
Qixiu Hu, Tsinghua University (China)

Page (NA) Paper number 1112

Abstract:

This paper describes a Text-Independent Speaker Identification System of high performance. This system includes two subsystems, one is the close-set speaker identification system; the other is the open-set speaker identification system. In the implementation of the Text-Independent Speaker Identification System we introduce an advanced VQ method and a new distance estimation algorithm called BCDM (Based on Codes Distribution Method). In the close-set identification, the Correct Recognition Rate is 98.5% as there are 50 speakers in the training set. In the open-set identification, the Equal Error Rate is 5% as there are 40 speakers in the training set.

SL981112.PDF (From Author) SL981112.PDF (Scanned)

TOP

Representation Of Voice Quality Features Associated With Talker Individuality

Authors:

Hiroshi Kido, Faculty of Engineering, Utsunomiya University and National Research Institute of Police Science (Japan)
Hideki Kasuya, Faculty of Engineering, Utsunomiya University (Japan)

Page (NA) Paper number 1005

Abstract:

As a first step toward development of a "speech montage system", this paper attempts to derive a core set of Japanese epithets which are commonly used in an everyday life to represent voice quality features associated with talker individuality. Perceptual experiments were conducted, where subjects were asked to evaluate sentence utterances recorded from a variety of male speakers in terms of 25 epithets which were derived in another experiment [1] to be indicative of voice quality relevant to talker individuality. The evaluation scores were subjected to a statistical clustering analysis. The analysis resulted in that the 25 epithets could be grouped into either eight categories for male or seven for female subjects. These categories were basically the same as those obtained in the previous experiment [1] where subjects were required to evaluate their own voice with the same set of 25 epithets. Agreement between the results from the two experiments guarantees reliability of the core epithet categories to represent voice quality associated with talker individuality.

SL981005.PDF (From Author) SL981005.PDF (Rasterized)

TOP

Candidate Selection Based on Significance Testing and its Use in Normalisation and Scoring

Authors:

Ji-Hwan Kim, Korea Advanced Institute of Science and Technology (Korea)
Gil-Jin Jang, Korea Advanced Institute of Science and Technology (Korea)
Seong-Jin Yun, Korea Advanced Institute of Science and Technology (Korea)
Yung Hwan Oh, Korea Advanced Institute of Science and Technology (Korea)

Page (NA) Paper number 261

Abstract:

Log likelihood ratio normalisation and scoring methods have been studied by many researchers and have improved the performance of speaker identification systems. However, these studies have disadvantages: the recognised distorted speech segments are different for each speaker. Also the background model in log likelihood ratio normalisation is changed in each speech segment even for the same speaker. This paper presents two techniques. Firstly, candidate selection based on significance testing, which designs the background speaker model more accurately. And secondly, the scoring method, which recognises the same distorted speech segments for every speaker. We perform a number of experiments with the SPIDRE database.

SL980261.PDF (From Author) SL980261.PDF (Rasterized)

TOP

Japanese Forensic Phonetics: Non-Contemporaneous Within-Speaker Variation In Natural And Read-Out Speech

Authors:

Yuko Kinoshita, Department of Linguistics (Faculty of Arts) and Japan Centre (Faculty of Asian studies), Australian National University (Australia)

Page (NA) Paper number 652

Abstract:

This paper aims to explore non-contemporaneous within-speaker variation of a Japanese male speaker, focusing on the difference between speech styles, viz.. natural speech and read-out speech. Recordings under forensic conditions are mostly of natural speech. A suspect[HEX 146]s recording to be compared are, however, sometimes read-out speech, but not natural speech, in order to obtain the similar phonological conditions to the original criminal speech. This paper aims to examine the validity of such a procedure.

SL980652.PDF (From Author) SL980652.PDF (Rasterized)

TOP

Statistical Modeling of Pronunciation and Production Variations for Speech Recognition

Authors:

Filipp Korkmazskiy, Lucent Technologies, Bell Laboratories (USA)
Biing-Hwang Juang, Lucent Technologies, Bell Laboratories (USA)

Page (NA) Paper number 345

Abstract:

In this paper, we propose a procedure for training a pronunciation network with criteria consistent with the optimality objectives for speech recognition systems. In particular, we describe a framework for using maximum likelihood(ML) and minimum classification error(MCE) criteria for pronunciation network optimization. The ML criterion is used to obtain an optimal structure for the pronunciation network based on statistically-derived phonological rules. Discrimination among different pronunciation networks is achieved by weighting of the pronunciation networks, optimized by applying the MCE criterion. Experiment results demonstrate improvements in speech recognition accuracy after applying statistically derived phonological rules. It is shown that the impact of the pronunciation network weighting on the recognition performance is determined by the size of the recognition vocabulary.

SL980345.PDF (From Author)

TOP

Dialect Maps and Dialect Research; Useful Tools for Automatic Speech Recognition?

Authors:

Arne Kjell Foldvik, Department of Linguistics, NTNU (Norway)
Knut Kvale, Telenor R&D (Norway)

Page (NA) Paper number 470

Abstract:

ABSTRACT Traditional dialect maps are based on data from carefully selected informants which usually results in clear-cut dialect borders, isoglosses, with one dialect characteristic present on one side of the isogloss and absent on the other. We illustrate some of the problems and pitfalls connected with using dialect maps for ASR by comparing results from traditional dialect research with investigations of the Norwegian part of the European SpeechDat database, centred on the two main types of /r/ pronunciation. Our analysis shows that traditional dialect maps and surveys may be of limited use in ASR. To what extent the Norwegian findings have parallels in other countries will depend on two main factors, dialect allegiance vs. a national standard pronunciation and the extent to which the population is sedentary or mobile. Results from traditional dialect research may therefore be more useful in ASR of other languages than Norwegian.

SL980470.PDF (From Author) SL980470.PDF (Rasterized)

TOP

Text Independent Speaker Recognition Using Micro-Prosody

Authors:

Youn-Jeong Kyung, KAIST (Korea)
Hwang-Soo Lee, KAIST, SK-telecom (Korea)

Page (NA) Paper number 407

Abstract:

The acoustic aspects that differentiate voices are difficult to separate from signal traits that reflect the identity of the sounds. There are two sources of variation among speakers: (1) differences in vocal cords and vocal tract shape, and (2) differences in speaking style. The latter includes variations in both target vocal tract positions for phonemes and dynamic aspects of speech, such as speaking rate. However, most parameters and features are in the former. In this paper, we propose the use of a prosodic feature that represents micro prosody of utterances. The robustness of the prosodic feature on noise environment becomes clear. Also we propose a combined model. The combined model uses both the spectral feature and the prosodic feature. In our experiments, this model provides robust speaker recognition in noise environments.

SL980407.PDF (From Author) SL980407.PDF (Rasterized)

TOP

Speaker Verification Using Fundamental Frequency

Authors:

Yoik Cheng, The Chinese University of Hong Kong (China)
Hong C. Leung, The Chinese University of Hong Kong (China)

Page (NA) Paper number 228

Abstract:

This paper describes the use of speech fundamental frequency (F0) for speaker verification. Both Chinese and English have been included in this study, with Chinese representing a tonal language and English representing a non-tonal language. A HMM-based speaker verification system has been developed, using features based on cepstral coefficients and the F0 contour. Four different techniques have been investigated in our experiments on the YOHO database and a Chinese speech database similar to YOHO. It has been found that the pitch information results in a reduction of the equal error rates by 40.5% and 33.9% in Cantonese and English, respectively, suggesting that the pitch information is important for speaker verification and that it is more important for tonal languages. We have also found that the pitch information is even more effective when it is represented in the log domain, resulting in an ERR of 2.28% for Cantonese. This ERR corresponds to a reduction of the ERR by 54%.

SL980228.PDF (From Author) SL980228.PDF (Rasterized)

TOP

On Optimum Normalization Method Used for Speaker Verification

Authors:

Weijie Liu, Laboratory for Information Technology, NTT Data Corporation (Japan)
Toshihiro Isobe, Laboratory for Information Technology, NTT Data Corporation (Japan)
Naoki Mukawa, Laboratory for Information Technology, NTT Data Corporation (Japan)

Page (NA) Paper number 1045

Abstract:

Score normalization has become necessary for speaker verification systems, but general principles leading to optimum performance are lacking. In the paper, theoretical analyses to optimum normalization are given. Under the analyses, four existing methods based on likelihood ratio, cohort, a posteriori probability and pooled cohort are investigated. Performance of these methods in verification with known imposters, robustness for different imposters and separability of the optimal threshold from the imposter model are discussed after experiments based on a database of 100 speakers.

SL981045.PDF (From Author) SL981045.PDF (Rasterized)

TOP

Recurrent Substrings and Data Fusion for Language Recognition

Authors:

Harvey Lloyd-Thomas, Ensigma (U.K.)
Eluned S. Parris, Ensigma (U.K.)
Jeremy H. Wright, AT&T (USA)

Page (NA) Paper number 1061

Abstract:

Recurrent phone substrings that are characteristic of a language are a promising technique for language recognition. In previous work on language recognition, building anti-models to normalise the scores from acoustic phone models for target languages, has been shown to reduce the Equal Error Rate (EER) by a third. Recurrent substrings and anti-models have now been applied alongside three other techniques (bigrams, usefulness and frequency histograms) to the NIST 1996 Language Recognition Evaluation, using data from the CALLFRIEND and OGI databases for training. By fusing scores from the different techniques using a multi-layer perceptron the ERR on the NIST data can be reduced further.

SL981061.PDF (From Author) SL981061.PDF (Rasterized)

TOP

Text-Independent Speaker Recognition Using Multiple Information Sources

Authors:

Konstantin P. Markov, Toyohashi University of Technology (Japan)
Seiichi Nakagawa, Toyohashi University of Technology (Japan)

Page (NA) Paper number 744

Abstract:

In the speaker recognition, when the cepstral coefficients are calculated from the LPC analysis parameters, the LPC residual and pitch are usually ignored. This paper describes an approach to integrate the pitch and LPC-residual with the LPC-cepstrum in a Gaussian Mixture Model based speaker recognition system. The pitch and LPC-residual are represented as a logarithm of the F0 and as a MFCC vector respectively. The second task of this research is to verify whether the correlation between the different information sources is useful for the speaker recognition task. The results showed that adding the pitch gives significant improvement only when the correlation between the pitch and cepstral coefficients is used. Adding only LPC-residual also gives significant improvement, but using the correlation with the cepstral coefficients does not have big effect. The best achieved results are 98.5% speaker identification rate and 0.21% speaker verification equal error rate compared to 97.0% and 1.07% of the baseline system, respectively.

SL980744.PDF (From Author) SL980744.PDF (Rasterized)

TOP

Discriminative Training Of GMM Using a Modified EM Algorithm for Speaker Recognition

Authors:

Konstantin P. Markov, Toyohashi University of Technology (Japan)
Seiichi Nakagawa, Toyohashi University of Technology (Japan)

Page (NA) Paper number 745

Abstract:

In this paper, we present a new discriminative training method for Gaussian Mixture Models (GMM) and its application for the text-independent speaker recognition. The objective of this method is to maximize the frame level normalized likelihoods of the training data. That is why we call it the Maximum Normalized Likelihood Estimation (MNLE). In contrast to other discriminative algorithms, the objective function is optimized using a modified Expectation- Maximization (EM) algorithm which greatly simplifies the training procedure. The evaluation experiments using both clean and telephone speech showed improvement of the recognition rates compared to the Maximum Likelihood Estimation (MLE) trained speaker models, especially when the mismatch between the training and testing conditions is significant.

SL980745.PDF (From Author) SL980745.PDF (Rasterized)

TOP

Language Identification Incorporating Lexical Information

Authors:

Driss Matrouf, LIMSI/CNRS (France)
Martine Adda-Decker, LIMSI/CNRS (France)
Lori F. Lamel, LIMSI/CNRS (France)
Jean-Luc Gauvain, LIMSI/CNRS (France)

Page (NA) Paper number 990

Abstract:

In this paper we explore the use of lexical information for language identification (LID). Our reference LID system uses language-dependent acoustic phone models and phone-based bigram language models. For each language, lexical information is introduced by augmenting the phone vocabulary with the N most frequent words in the training data. Combined phone and word bigram models are used to provide linguistic constraints during acoustic decoding. Experiments were carried out on a 4-language telephone speech corpus. Using lexical information achieves a relative error reduction of about 20% on spontaneous and read speech compared to the reference phone-based system. Identification rates of 92%, 96% and 99% are achieved for spontaneous, read and task-specific speech segments respectively, with prior speech detection.

SL980990.PDF (From Author) SL980990.PDF (Rasterized)

TOP

A VQ Based Speaker Recognition System Based in Histogram Distances. Text Independent and for Noisy Environments

Authors:

Enric Monte, UPC (Spain)
Ramón Arqué, UPC (Spain)
Xavier Miró, UPC (Spain)

Page (NA) Paper number 1145

Abstract:

In speaker recognition systems based on VQ, normally each speaker is assigned a codebook, and the classification is done by means of the a distortion distance of the utterance computed by means of each codebook. In [1] we proposed a system which instead of having a codebook for each speaker, had only one codebook for all the speakers, and for each speaker one histogram. This histogram was the occupancy rate of each codeword for a given speaker. This means that the information of the histogram of a given speaker is the probability that the speaker utters the information related to the codeword. So we approximated the pdf of each speaker by the normalized histogram. In this paper we present an exhaustive study of different measures for comparing histograms: Kullbach-Leiber, log-difference of each probability, geometrical distance, and the Euclidean distance. We have done also an exhaustive study of the properties of the system for each distance in the presence of noise (white and colored), and for different parameterizations: LPC, MFCC, LPC-Cepstrum-OSA (One sided autocorrelation sequence), LCP-Cepstrum. (Cepstrum with/without liftering). As the combination of experiments was high, the conclusions were drawn after an analysis of variance (ANOVA), and T-tests. Thus the conclusions, with significance levels, can be drawn about the differences and interactions between kind of. distance, parameterization, kind of noise and level of noise.

SL981145.PDF (From Author) SL981145.PDF (Rasterized)

TOP

Spanish Dialects: Phonetic Transcription

Authors:

Asunción Moreno, Universitat Politécnica de Catalunya (Spain)
José B. Mariño, Universitat Politécnica de Catalunya (Spain)

Page (NA) Paper number 598

Abstract:

It is well known that canonical Spanish, the dialectal variant `central' of Spain, so called Castilian, can be transcribed by rules. This paper deals with the automatic grapheme to phoneme transcription rules in several Spanish dialects from Latin America. Spanish is a language spoken by more than 300 million people, has an important geographical dispersion compared among other languages and has been historically influenced by many native languages. In this paper authors expand the Castilian transcription rules to a set of different dialectal variants of Latin America. Transcriptions are based on SAMPA symbols. The paper includes an identification of sounds that doesn't appear in Castilian, extend accepted SAMPA symbols for Spanish (Castilian) to different dialectal variants, describes the necessary rules to implement an automatic Orthographic to Phonetic transcription in several dialectal Spanish variants and show some quantitative results of dialectal differences.

SL980598.PDF (From Author) SL980598.PDF (Rasterized)

TOP

Acoustic Analysis of Japanese English Prosody: Comparison Between Fukushima Dialect Speakers and Tokyo Dialect Speakers in Declarative Sentences and Yes-No Questions

Authors:

Mieko Muramatsu, Fukushima Medical University/Reading University (U.K.)

Page (NA) Paper number 1090

Abstract:

L1 transfer may explain prosodic errors in an L2. For Japanese English prosody, several comparative studies have been conducted. However, only oral reading texts have been used and not much attention has been paid to the effect of differences in the L1 dialect, especially in an "accentless" Japanese dialect. This preliminary study describes an investigation of the differences in L1 dialect prosodic transfer to English between the speakers of the Fukushima dialect (an accentless dialect) and the Tokyo dialect (an accent dialect) in declarative sentences and yes-no questions. A two-way communicative task was selected to induce natural utterances. The fundamental frequency at each point of twenty equally-spaced points of observation of three female voices from each dialect group was measured. The major finding is there do appear to be dialectal differences in L1 transfer of prosody. However, this preliminary study is not conclusive and more comprehensive investigation will be necessary.

SL981090.PDF (From Author) SL981090.PDF (Rasterized)

TOP

A Context-Dependent Approach for Speaker Verification Using Sequential Decision

Authors:

Hideki Noda, Kyushu Institute of Technology (Japan)
Katsuya Harada, Kyushu Institute of Technology (Japan)
Eiji Kawaguchi, Kyushu Institute of Technology (Japan)
Hidefumi Sawai, Communications Research Laboratory (Japan)

Page (NA) Paper number 108

Abstract:

This paper is concerned about speaker verification (SV) using the sequential probability ratio test (SPRT). In the SPRT input samples are usually assumed to be i.i.d. samples from a probability density function because an on-line probability computation is required. Feature vectors used in speech processing obviously do not satisfy the assumption and therefore the correlation between successive feature vectors has not been considered in conventional SV using the SPRT. The correlation can be modeled by the hidden Markov model (HMM) but unfortunately the HMM can not be directly applied to the SPRT because of statistical dependence of input samples. This paper proposes a method of HMM probability computation using the mean field approximation to resolve this problem, where the probability of whole input samples is nominally represented as the product of probability of each sample as if input samples were independent each other.

SL980108.PDF (From Author) SL980108.PDF (Rasterized)

TOP

Quantitative Influence of Speech Variability Factors for Automatic Speaker Verification in Forensic Tasks

Authors:

Javier Ortega-García, Universidad Politecnica de Madrid (Spain)
Santiago Cruz-Llanas, Universidad Politecnica de Madrid (Spain)
Joaquin González-Rodríguez, Universidad Politecnica de Madrid (Spain)

Page (NA) Paper number 1062

Abstract:

Regarding speaker identity in forensic conditions, several factors of variability must be taken into account, as peculiar intra-speaker variability, forced intra-speaker variability or channel-dependent external influences. Using 'AHUMADA' large speech database in Spanish, containing several recording sessions and channels, and including different tasks for 100 male speakers, automatic speaker verification experiments have been accomplished. Due to the inherent non-cooperative nature of speakers in forensic applications, only text-independent recognizers are used. In this sense, a GMM-based verification system is being used in order to obtain quantitative results. Maximum likelihood estimation of the models is performed, and LPC-cepstra, delta- and delta-delta-LPCC, are used at the parameterization stage. With this baseline verification system, we intend to determine how some variability sources included in 'AHUMADA' affect speaker identification. Results including speaking rate influence, single- and multi-session training, cross-channel testing, and kind of speech (read vs. spontaneous) are presented when likelihood-domain normalization is applied.

SL981062.PDF (From Author) SL981062.PDF (Rasterized)

TOP

Creating Hidden Markov Models for Fast Speech

Authors:

Thilo Pfau, Institute for Human-Machine-Communication, Technical University of Munich (Germany)
Guenther Ruske, Institute for Human-Machine-Communication, Technical University of Munich (Germany)

Page (NA) Paper number 255

Abstract:

This paper deals with the problem of building hidden Markov models (HMMs) suitable for fast speech. First an automatic procedure is presented to split speech material into different categories according to the speaking rate. Then the problem of sparse data available for the estimation of HMMs for fast speech is discussed. A comparison of different methods to overcome this problem follows. The main emphasis here is set on robust reestimation techniques like maximum aposteriori estimation (MAP) as well as on methods to reduce the variability of the speech signal and therefore to be able to reduce the number of HMM parameters. Vocaltract length normalization (VTLN) is chosen for that purpose. Finally a comparison of various combinations of the methods discussed is presented basing on word error rates for fast speech. The best method (MAPVTLN) results in a decrease of the error rate of 10% relative to the baseline system.

SL980255.PDF (From Author) SL980255.PDF (Rasterized)

TOP

Speaker Identification using Relaxation Labeling

Authors:

Tuan Pham, Faculty of Information Sciences & Engineering, University of Canberra (Australia)
Michael Wagner, Faculty of Information Sciences & Engineering, University of Canberra (Australia)

Page (NA) Paper number 949

Abstract:

A nonlinear probabilistic relaxation labeling for speaker identification is presented in this paper. This relaxation scheme, which is an iterative and parallel process, offers a flexible and effective framework for dealing with uncertainty inherently existing in the labeling of the speech feature vectors. Basic concepts and formulations of the relaxation algorithms are outlined. We then discuss how to model the relaxation scheme to the labeling of the speech feature vectors for the speaker identification task. The implementation is tested on a commercial speech corpus TI46. The results using several codebook sizes obtained from the proposed approach are more favorable than those from the conventional VQ (Vector Quantization)-based method.

SL980949.PDF (From Author) SL980949.PDF (Rasterized)

TOP

A Novel Technique for the Combination of Utterance and Speaker Verification Systems in a Text-Dependent Speaker Verification Task

Authors:

Leandro Rodríguez-Liñares, University of Vigo (Spain)
Carmen García-Mateo, University of Vigo (Spain)

Page (NA) Paper number 1084

Abstract:

In this paper we present a novel technique for combining a Speaker Verification System with an Utterance Verification System in a Speaker Authentication system over the telephone. Speaker Verification consists in accepting or rejecting the claimed identity of a speaker by processing samples of his/her voice. Usually, these systems are based on HMM's that try to represent the characteristics of the talkers' vocal tracts. Utterance Verification systems make use of a set of speaker-independent speech models to recognize a certain utterance and decide whether a speaker has uttered it or not. If the utterances consist of passwords, this can be used for identity verification purposes. Up to now, both techniques have been used separately. This paper is focused on the problem of how to combine these two sources of information. A new architecture is presented to join an utterance verification system and a speaker verification system in order to improve the performance in a text-dependent speaker verification task.

SL981084.PDF (From Author) SL981084.PDF (Rasterized)

TOP

A Forensic Phonetic Investigation into Non-contemporaneous Variation in the F-pattern of Similar-sounding Speakers.

Authors:

Phil Rose, Department of Linguistics (Arts), Australian National University (Australia)

Page (NA) Paper number 301

Abstract:

A forensic phonetic experiment is described which investigates the nature of non-contemporaneous within-speaker variation for six similar-sounding speakers. Between 8 and 10 intonationally varying tokens of the naturally produced single word utterance hello were elicited from six similar-sounding adult Australian males in two repeats separated by a reading of the "rainbow" passage. Both repeats are compared with a single batch of intonationally varying hello tokens recorded at least one year earlier. Within-speaker variation is quantified by ANOVA on mean non-contemporaneous differences and Scheffe's F for centre frequencies of the first 4 formants at 7 well-defined points in the word. Values for non-contemporaneous within-speaker between-token differences are also given, and their contribution to a Bayesian Likelihood Ratio is exemplified.

SL980301.PDF (From Author) SL980301.PDF (Scanned)

TOP

Human vs. Machine Speaker Identification with Telephone Speech

Authors:

Astrid Schmidt-Nielsen, U.S. Naval Research Laboratory (USA)
Thomas H. Crystal, IDA Center for Communications Research (USA)

Page (NA) Paper number 148

Abstract:

An experiment compared the speaker recognition performance of human listeners to that of computer algorithms/systems. Listening protocols were developed analogous to procedures used in the algorithm evaluation run by the U.S. National Institute of Standards and Technology (NIST), and the same telephone conversation data were used. For "same number" testing, with three-second samples, listener panels and the best algorithm had the same equal-error rate (EER) of 8%. Listeners were better than typical algorithms. For "different number" testing, EER's increased but humans had a 40% lower equal-error rate. Other observations on human listening performance and robustness to "degradations" were made.

SL980148.PDF (From Author) SL980148.PDF (Rasterized)

TOP

A Comparison of Fusion Techniques in Mel-cepstral Based Speaker Identification

Authors:

Stefan Slomka, Speech Laboratory, Queensland University of Technology (Australia)
Sridha Sridharan, Speech Laboratory, Queensland University of Technology (Australia)
Vinod Chandran, Speech Laboratory, Queensland University of Technology (Australia)

Page (NA) Paper number 123

Abstract:

Input level fusion and output level fusion methods are compared for fusing Mel-frequency Cepstral Coefficients with their corresponding delta coefficients. A 49 speaker subset of the King database is used under wideband and telephone conditions. The best input level fusion system is more computationally complex than the output level fusion system. Both input and output fusion systems were able to outperform the best purely MFCC based system for wideband data. For King telephone data, only the output level fusion based system was able to outperform the best purely MFCC based system. Further experiments using NIST'96 data under matched and mismatched conditions were also performed. Provided it was well tuned, we found that the output level fused system always outperformed the input level fused system under all experimental conditions.

SL980123.PDF (From Author) SL980123.PDF (Rasterized)

TOP

On the Influence of Hyperarticulated Speech on Recognition Performance

Authors:

Hagen Soltau, Interactive Systems Laboratories, University of Karlsruhe (Germany), Carnegie Mellon University (USA) (Germany)
Alex Waibel, Interactive Systems Laboratories, University of Karlsruhe (Germany), Carnegie Mellon University (USA) (Germany)

Page (NA) Paper number 736

Abstract:

Since we cannot exclude that speech recognizers fail sometimes, it is important to examine how users react to recognition errors. In correction situations, speaking style becomes more accentuated to disambiguate the original mistake. We examine the effect of speaking style in such situations on speech recognition performance. Our results indicate that hyperarticulated effects occur in correction situations and decrease word accuracy significantly.

SL980736.PDF (From Author) SL980736.PDF (Rasterized)

TOP

Text-Independent Speaker Identification and Verification Using the TIMIT Database

Authors:

Nuala C. Ward, Alcatel Australia (Australia)
Dominik R. Dersch, Department of Electrical Engineering, University of Sydney (Australia)

Page (NA) Paper number 291

Abstract:

This paper presents a neural network inspired approach to speaker recognition using speaker models constructed from full data sets. A similarity measure between data sets is used for text-independent speaker identification and verification. In order to reduce the computational effort in calculating the similarity measure, a fuzzy Vector Quantisation procedure is applied. This method has previously been successfully applied to a database of 108 Australian English speakers. The purpose of this paper is to apply this method to a larger benchmark database of 630 speakers (TIMIT Database). Using the full 630-speaker database, an accuracy of 98.2% (one test sentence) and 99.7% (two test sentences) was achieved for text-independent speaker identification. On a 462-speaker subset of the database a 98.5% successful acceptance and 96.9% successful rejection rate for text-independent speaker verification was achieved.

SL980291.PDF (From Author) SL980291.PDF (Rasterized)

TOP

Incorporating Linguistic Knowledge Into Automatic Dialect Identification of Spanish

Authors:

Lisa R. Yanguas, M.I.T. Lincoln Laboratory (USA)
Gerald C. O'Leary, M.I.T. Lincoln Laboratory (USA)
Marc A. Zissman, M.I.T. Lincoln Laboratory (USA)

Page (NA) Paper number 1136

Abstract:

In this paper we exploit linguistic knowledge to aid in automatic dialect identification in Spanish. Segments of extemporaneous Cuban and Peruvian Spanish dialect data from the Miami Corpus were analyzed and 49 linguistic features that occur with different rates in each of the two dialects identified and hand-labelled. We evaluate the expected performance of the dialect detection system based on a theoretical model and compute the systems' performance. Using a Gaussian classifier we show that a subset of the 49 originally-identified features obtains nearly perfect performance for discriminating between the two dialects. We compare these results with those from an automatic recognition system (PRLM-P). We then test this system in the limited domain of read digits from 0 through 10 using an orthographic transcription and hand-marked data for phone extraction and alignment. Initial experiments on phone-level segments show that phone duration and energy computations prove discriminatory for dialect discrimination.

SL981136.PDF (From Author) SL981136.PDF (Rasterized)

TOP

A Novel Text-Independent Speaker Verification Method Using the Global Speaker Model

Authors:

Yiying Zhang, Department of Computer Science, Tsinghua University (China)
Xiaoyan Zhu, Department of Computer Science, Tsinghua University (China)

Page (NA) Paper number 1144

Abstract:

In this paper a new text-independent speaker verification method is proposed based on likelihood score normalization and the global speaker model, which is established to represent the universal features of speech and environment, and to normalize the likelihood score. As a result the equal error rates are decreased significantly, verification procedure is accelerated and system adaptability is improved. Two possible ways of establishing the global speaker model, one of which can meet the real-time requirement, are also suggested and discussed. Experiments demonstrate the effectiveness of this novel verification method and its improvement over the conventional method and other normalization methods.

SL981144.PDF (From Author) SL981144.PDF (Rasterized)

TOP