Full List of Titles 1: ICSLP'98 Proceedings 2: SST Student Day Author Index A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Multimedia Files |
Sub-Band Based Speaker Verification Using Dynamic Recombination WeightsAuthors:
Perasiriyan Sivakumaran, University of Hertfordshire (U.K.)
Page (NA) Paper number 1055Abstract:The concept of splitting the entire frequency domain into sub-bands and processing the spectra in these bands independently in between every consecutive recombination stage to generate a final score has recently been proposed for speech recognition. Some of the aspects of this technique have also been studied for the task of speaker recognition. A remaining critical problem in this approach, however, is the determination of appropriate recombination weights. This paper presents a new method for generating these weights for sub-band based speaker verification. The approach is based on the use of background speaker models and aims to reduce the effect of any mismatch between the band-limited segments of the test utterance and the corresponding sections in the target speaker model. The paper also discusses a problem generally associated with the sub-band cepstral features and outlines a possible solution.
|
0734_01.WAV(was: 734_1.wav) | source sound File type: Sound File Format: Sound File: WAV Tech. description: 8khz, 16 bit mono adpcm Creating Application:: cool96 Creating OS: win95 |
0734_02.PDF(was: 734_2.jpg) | source spectrogram File type: Image File Format: Image : JPEG Tech. description: None Creating Application:: lview Creating OS: win95 |
0734_03.WAV(was: 734_3.wav) | target sound File type: Sound File Format: Sound File: WAV Tech. description: 8khz, 16 bit mono adpcm Creating Application:: cool96 Creating OS: win95 |
0734_04.PDF(was: 734_4.jpg) | target spectrogram File type: Image File Format: Image : JPEG Tech. description: None Creating Application:: lview Creating OS: win95 |
0734_05.WAV(was: 734_5.wav) | noise source hmm transformation File type: Sound File Format: Sound File: WAV Tech. description: 8khz, 16 bit mono adpcm Creating Application:: cool96 Creating OS: win95 |
0734_06.PDF(was: 734_6.jpg) | source noise hmm transformation spectrogram File type: Image File Format: Image : JPEG Tech. description: None Creating Application:: lview Creating OS: win95 |
0734_07.WAV(was: 734_7.wav) | random background noise hmm transformation File type: Sound File Format: Sound File: WAV Tech. description: 8khz, 16 bit mono adpcm Creating Application:: cool96 Creating OS: win95 |
0734_08.PDF(was: 734_8.jpg) | random background nois hmm transformation spectrogram File type: Image File Format: Image : JPEG Tech. description: None Creating Application:: lview Creating OS: win95 |
0734_09.WAV(was: 734_9.wav) | original sound File type: Sound File Format: Sound File: WAV Tech. description: None Creating Application:: cool edit Creating OS: win95 |
0734_10.PDF(was: 734_10.jpg) | original spectrogram File type: Image File Format: Image : JPEG Tech. description: None Creating Application:: lview Creating OS: win95 |
0734_11.WAV(was: 734_11.wav) | harmonics subtraction sound File type: Sound File Format: Sound File: WAV Tech. description: 8khz, 16 bit mono adpcm Creating Application:: cool edit Creating OS: win95 |
0734_12.PDF(was: 734_12.jpg) | harmonics subtraction spectrogram File type: Image File Format: Image : JPEG Tech. description: None Creating Application:: lview Creating OS: win95 |
Mike Lincoln, University of East Anglia (U.K.)
Stephen Cox, University of East Anglia (U.K.)
Simon Ringland, British Telecom Laboratories (U.K.)
The ability to automatically identify a speaker's accent would be very useful for a speech recognition system as it would enable the system to use both a pronunciation dictionary and speech models specific to the accent, techniques which have been shown to improve accuracy. Here, we describe some experiments in unsupervised accent classification. Two techniques have been investigated to classify British- and American-accented speech: an acoustic approach, in which we analyse the pattern of usage of the distributions in the recogniser by a speaker to decide on his most probable accent, and a high-level approach in which we use a phonotactic model for classification of the accent. Results show that both techniques give excellent performance on this task which is maintained when testing is done on data from an independent dataset.
Dominik R. Dersch, University of Sydney, Department of Electrical Engineering (Australia)
Christopher Cleirigh, University of Sydney, Department of Electrical Engineering (Australia)
Julie Vonwiller, University of Sydney, Department of Electrical Engineering (Australia)
We analyse and compare a low dimensional linguistic representation of vowels with high dimensional prototypical vowel templates derived from native Australian English speaker. To simplify the problem, the study is restricted to a group of short and long vowels. In a low dimensional linguistic representation a vowel is represented by the horizontal and vertical position of the part of the tongue involved in the key articulation of a particular vowel, e.g., high or low and front or back. To this is added lip posture, spread or rounded. For a comparison we perform a multidimensional scaling transformation of high dimensional vowel clusters derived from speech samples. We further performed the same analysis on Lebanese and Vietnamese accented English to investigate how differences due to accents impact on such a representation.
J.A. du Preez, University of Stellenbosch (South Africa)
D.M. Weber, University of Stellenbosch (South Africa)
We present automatic language recognition results using high-order hidden Markov models (HMM) and the recently developed ORder rEDucing (ORED) and Fast Incremental Training (FIT) HMM algorithms. We demonstrate the efficiency and accuracy of pseudo-phoneme context and duration modelling mixed-order HMMs as well as fixed order HMMs over conventional approaches. For a two language problem, we show that a third-order FIT trained HMM gives a test set accuracy of 97.4% compared to 89.7% for a conventionally trained third-order HMM. A first-order model achieved 82.1% accuracy on the same problem.
Marcos Faúndez-Zanuy, Escola Universitaria Politecnica de Mataro (Spain)
Daniel Rodríguez-Porcheron, Universidad Politecnica de Catalunya (Spain)
This Paper discusses the usefullness of the residual signal for speaker recognition. It is shown that the combination of both a measure defined over LPCC coefficients and a measure defined over the energy of the residual signal gives rise to an improvement over the classical method which considers only the LPCC coefficients. If the residual signal is obtained from a linear prediction analisys, the improvement is 2.63% (error rate drops from 6.31% to 3.68%) and if it is computed through a nonlinear predictive neural nets based model, the improvement is 3.68%.
Yong Gu, Vocalis Ltd. (U.K.)
Trevor Thomas, Vocalis Ltd. (U.K.)
This paper presents a HMM-based speaker verification system which was implemented for a field trial. One of the challenges for moving HMM from speech recognition to speaker verification is to understand the HMM score variation and to define a proper measurement which is comparable across speech samples. In this paper we define two basic verification measurements, a qualifier-based measurement and a competition-based measurement, and examine score normalisation approaches using these two measurements. This leads to some useful theoretical differentiation between cohort model and world model approaches used for HMM score normalisation. We adopted a world model method for score normalisation in the system. The adaptive variance flooring technique is also implemented in the system. The paper presents evaluation results of the implementation.
Javier Hernando, Polytechnical University of Catalonia (Spain)
Climent Nadeu, Polytechnical University of Catalonia (Spain)
The spectral parameters that result from filtering the frequency sequence of log mel-scaled filter-bank energies with a first or second order FIR filter have proved to be competitive for speech recognition. Recently, the authors have shown that this frequency filtering can approximately equalize the cepstrum variance enhancing the oscillations of the spectral envelope curve that are most effective for discrimination between speakers. Even better speaker identification results than using mel-cepstrum were observed on the TIMIT database, especially when white noise was added. In this paper, the hybridization of both linear prediction and filter-bank spectral analysis using either cepstral transformation or the alternative frequency filtering is explored for speaker verification. This combination, that had shown to be able to outperform the conventional techniques in clean and noisy word recognition, has yield good text-dependent speaker verification results on the new speaker-oriented telephone-line POLYCOST database.
Qin Jin, Tsinghua University (China)
Luo Si, Tsinghua University (China)
Qixiu Hu, Tsinghua University (China)
This paper describes a Text-Independent Speaker Identification System of high performance. This system includes two subsystems, one is the close-set speaker identification system; the other is the open-set speaker identification system. In the implementation of the Text-Independent Speaker Identification System we introduce an advanced VQ method and a new distance estimation algorithm called BCDM (Based on Codes Distribution Method). In the close-set identification, the Correct Recognition Rate is 98.5% as there are 50 speakers in the training set. In the open-set identification, the Equal Error Rate is 5% as there are 40 speakers in the training set.
Hiroshi Kido, Faculty of Engineering, Utsunomiya University and National Research Institute of Police Science (Japan)
Hideki Kasuya, Faculty of Engineering, Utsunomiya University (Japan)
As a first step toward development of a "speech montage system", this paper attempts to derive a core set of Japanese epithets which are commonly used in an everyday life to represent voice quality features associated with talker individuality. Perceptual experiments were conducted, where subjects were asked to evaluate sentence utterances recorded from a variety of male speakers in terms of 25 epithets which were derived in another experiment [1] to be indicative of voice quality relevant to talker individuality. The evaluation scores were subjected to a statistical clustering analysis. The analysis resulted in that the 25 epithets could be grouped into either eight categories for male or seven for female subjects. These categories were basically the same as those obtained in the previous experiment [1] where subjects were required to evaluate their own voice with the same set of 25 epithets. Agreement between the results from the two experiments guarantees reliability of the core epithet categories to represent voice quality associated with talker individuality.
Ji-Hwan Kim, Korea Advanced Institute of Science and Technology (Korea)
Gil-Jin Jang, Korea Advanced Institute of Science and Technology (Korea)
Seong-Jin Yun, Korea Advanced Institute of Science and Technology (Korea)
Yung Hwan Oh, Korea Advanced Institute of Science and Technology (Korea)
Log likelihood ratio normalisation and scoring methods have been studied by many researchers and have improved the performance of speaker identification systems. However, these studies have disadvantages: the recognised distorted speech segments are different for each speaker. Also the background model in log likelihood ratio normalisation is changed in each speech segment even for the same speaker. This paper presents two techniques. Firstly, candidate selection based on significance testing, which designs the background speaker model more accurately. And secondly, the scoring method, which recognises the same distorted speech segments for every speaker. We perform a number of experiments with the SPIDRE database.
Yuko Kinoshita, Department of Linguistics (Faculty of Arts) and Japan Centre (Faculty of Asian studies), Australian National University (Australia)
This paper aims to explore non-contemporaneous within-speaker variation of a Japanese male speaker, focusing on the difference between speech styles, viz.. natural speech and read-out speech. Recordings under forensic conditions are mostly of natural speech. A suspect[HEX 146]s recording to be compared are, however, sometimes read-out speech, but not natural speech, in order to obtain the similar phonological conditions to the original criminal speech. This paper aims to examine the validity of such a procedure.
Filipp Korkmazskiy, Lucent Technologies, Bell Laboratories (USA)
Biing-Hwang Juang, Lucent Technologies, Bell Laboratories (USA)
In this paper, we propose a procedure for training a pronunciation network with criteria consistent with the optimality objectives for speech recognition systems. In particular, we describe a framework for using maximum likelihood(ML) and minimum classification error(MCE) criteria for pronunciation network optimization. The ML criterion is used to obtain an optimal structure for the pronunciation network based on statistically-derived phonological rules. Discrimination among different pronunciation networks is achieved by weighting of the pronunciation networks, optimized by applying the MCE criterion. Experiment results demonstrate improvements in speech recognition accuracy after applying statistically derived phonological rules. It is shown that the impact of the pronunciation network weighting on the recognition performance is determined by the size of the recognition vocabulary.
Arne Kjell Foldvik, Department of Linguistics, NTNU (Norway)
Knut Kvale, Telenor R&D (Norway)
ABSTRACT Traditional dialect maps are based on data from carefully selected informants which usually results in clear-cut dialect borders, isoglosses, with one dialect characteristic present on one side of the isogloss and absent on the other. We illustrate some of the problems and pitfalls connected with using dialect maps for ASR by comparing results from traditional dialect research with investigations of the Norwegian part of the European SpeechDat database, centred on the two main types of /r/ pronunciation. Our analysis shows that traditional dialect maps and surveys may be of limited use in ASR. To what extent the Norwegian findings have parallels in other countries will depend on two main factors, dialect allegiance vs. a national standard pronunciation and the extent to which the population is sedentary or mobile. Results from traditional dialect research may therefore be more useful in ASR of other languages than Norwegian.
Youn-Jeong Kyung, KAIST (Korea)
Hwang-Soo Lee, KAIST, SK-telecom (Korea)
The acoustic aspects that differentiate voices are difficult to separate from signal traits that reflect the identity of the sounds. There are two sources of variation among speakers: (1) differences in vocal cords and vocal tract shape, and (2) differences in speaking style. The latter includes variations in both target vocal tract positions for phonemes and dynamic aspects of speech, such as speaking rate. However, most parameters and features are in the former. In this paper, we propose the use of a prosodic feature that represents micro prosody of utterances. The robustness of the prosodic feature on noise environment becomes clear. Also we propose a combined model. The combined model uses both the spectral feature and the prosodic feature. In our experiments, this model provides robust speaker recognition in noise environments.
Yoik Cheng, The Chinese University of Hong Kong (China)
Hong C. Leung, The Chinese University of Hong Kong (China)
This paper describes the use of speech fundamental frequency (F0) for speaker verification. Both Chinese and English have been included in this study, with Chinese representing a tonal language and English representing a non-tonal language. A HMM-based speaker verification system has been developed, using features based on cepstral coefficients and the F0 contour. Four different techniques have been investigated in our experiments on the YOHO database and a Chinese speech database similar to YOHO. It has been found that the pitch information results in a reduction of the equal error rates by 40.5% and 33.9% in Cantonese and English, respectively, suggesting that the pitch information is important for speaker verification and that it is more important for tonal languages. We have also found that the pitch information is even more effective when it is represented in the log domain, resulting in an ERR of 2.28% for Cantonese. This ERR corresponds to a reduction of the ERR by 54%.
Weijie Liu, Laboratory for Information Technology, NTT Data Corporation (Japan)
Toshihiro Isobe, Laboratory for Information Technology, NTT Data Corporation (Japan)
Naoki Mukawa, Laboratory for Information Technology, NTT Data Corporation (Japan)
Score normalization has become necessary for speaker verification systems, but general principles leading to optimum performance are lacking. In the paper, theoretical analyses to optimum normalization are given. Under the analyses, four existing methods based on likelihood ratio, cohort, a posteriori probability and pooled cohort are investigated. Performance of these methods in verification with known imposters, robustness for different imposters and separability of the optimal threshold from the imposter model are discussed after experiments based on a database of 100 speakers.
Harvey Lloyd-Thomas, Ensigma (U.K.)
Eluned S. Parris, Ensigma (U.K.)
Jeremy H. Wright, AT&T (USA)
Recurrent phone substrings that are characteristic of a language are a promising technique for language recognition. In previous work on language recognition, building anti-models to normalise the scores from acoustic phone models for target languages, has been shown to reduce the Equal Error Rate (EER) by a third. Recurrent substrings and anti-models have now been applied alongside three other techniques (bigrams, usefulness and frequency histograms) to the NIST 1996 Language Recognition Evaluation, using data from the CALLFRIEND and OGI databases for training. By fusing scores from the different techniques using a multi-layer perceptron the ERR on the NIST data can be reduced further.
Konstantin P. Markov, Toyohashi University of Technology (Japan)
Seiichi Nakagawa, Toyohashi University of Technology (Japan)
In the speaker recognition, when the cepstral coefficients are calculated from the LPC analysis parameters, the LPC residual and pitch are usually ignored. This paper describes an approach to integrate the pitch and LPC-residual with the LPC-cepstrum in a Gaussian Mixture Model based speaker recognition system. The pitch and LPC-residual are represented as a logarithm of the F0 and as a MFCC vector respectively. The second task of this research is to verify whether the correlation between the different information sources is useful for the speaker recognition task. The results showed that adding the pitch gives significant improvement only when the correlation between the pitch and cepstral coefficients is used. Adding only LPC-residual also gives significant improvement, but using the correlation with the cepstral coefficients does not have big effect. The best achieved results are 98.5% speaker identification rate and 0.21% speaker verification equal error rate compared to 97.0% and 1.07% of the baseline system, respectively.
Konstantin P. Markov, Toyohashi University of Technology (Japan)
Seiichi Nakagawa, Toyohashi University of Technology (Japan)
In this paper, we present a new discriminative training method for Gaussian Mixture Models (GMM) and its application for the text-independent speaker recognition. The objective of this method is to maximize the frame level normalized likelihoods of the training data. That is why we call it the Maximum Normalized Likelihood Estimation (MNLE). In contrast to other discriminative algorithms, the objective function is optimized using a modified Expectation- Maximization (EM) algorithm which greatly simplifies the training procedure. The evaluation experiments using both clean and telephone speech showed improvement of the recognition rates compared to the Maximum Likelihood Estimation (MLE) trained speaker models, especially when the mismatch between the training and testing conditions is significant.
Driss Matrouf, LIMSI/CNRS (France)
Martine Adda-Decker, LIMSI/CNRS (France)
Lori F. Lamel, LIMSI/CNRS (France)
Jean-Luc Gauvain, LIMSI/CNRS (France)
In this paper we explore the use of lexical information for language identification (LID). Our reference LID system uses language-dependent acoustic phone models and phone-based bigram language models. For each language, lexical information is introduced by augmenting the phone vocabulary with the N most frequent words in the training data. Combined phone and word bigram models are used to provide linguistic constraints during acoustic decoding. Experiments were carried out on a 4-language telephone speech corpus. Using lexical information achieves a relative error reduction of about 20% on spontaneous and read speech compared to the reference phone-based system. Identification rates of 92%, 96% and 99% are achieved for spontaneous, read and task-specific speech segments respectively, with prior speech detection.
Enric Monte, UPC (Spain)
Ramón Arqué, UPC (Spain)
Xavier Miró, UPC (Spain)
In speaker recognition systems based on VQ, normally each speaker is assigned a codebook, and the classification is done by means of the a distortion distance of the utterance computed by means of each codebook. In [1] we proposed a system which instead of having a codebook for each speaker, had only one codebook for all the speakers, and for each speaker one histogram. This histogram was the occupancy rate of each codeword for a given speaker. This means that the information of the histogram of a given speaker is the probability that the speaker utters the information related to the codeword. So we approximated the pdf of each speaker by the normalized histogram. In this paper we present an exhaustive study of different measures for comparing histograms: Kullbach-Leiber, log-difference of each probability, geometrical distance, and the Euclidean distance. We have done also an exhaustive study of the properties of the system for each distance in the presence of noise (white and colored), and for different parameterizations: LPC, MFCC, LPC-Cepstrum-OSA (One sided autocorrelation sequence), LCP-Cepstrum. (Cepstrum with/without liftering). As the combination of experiments was high, the conclusions were drawn after an analysis of variance (ANOVA), and T-tests. Thus the conclusions, with significance levels, can be drawn about the differences and interactions between kind of. distance, parameterization, kind of noise and level of noise.
Asunción Moreno, Universitat Politécnica de Catalunya (Spain)
José B. Mariño, Universitat Politécnica de Catalunya (Spain)
It is well known that canonical Spanish, the dialectal variant `central' of Spain, so called Castilian, can be transcribed by rules. This paper deals with the automatic grapheme to phoneme transcription rules in several Spanish dialects from Latin America. Spanish is a language spoken by more than 300 million people, has an important geographical dispersion compared among other languages and has been historically influenced by many native languages. In this paper authors expand the Castilian transcription rules to a set of different dialectal variants of Latin America. Transcriptions are based on SAMPA symbols. The paper includes an identification of sounds that doesn't appear in Castilian, extend accepted SAMPA symbols for Spanish (Castilian) to different dialectal variants, describes the necessary rules to implement an automatic Orthographic to Phonetic transcription in several dialectal Spanish variants and show some quantitative results of dialectal differences.
Mieko Muramatsu, Fukushima Medical University/Reading University (U.K.)
L1 transfer may explain prosodic errors in an L2. For Japanese English prosody, several comparative studies have been conducted. However, only oral reading texts have been used and not much attention has been paid to the effect of differences in the L1 dialect, especially in an "accentless" Japanese dialect. This preliminary study describes an investigation of the differences in L1 dialect prosodic transfer to English between the speakers of the Fukushima dialect (an accentless dialect) and the Tokyo dialect (an accent dialect) in declarative sentences and yes-no questions. A two-way communicative task was selected to induce natural utterances. The fundamental frequency at each point of twenty equally-spaced points of observation of three female voices from each dialect group was measured. The major finding is there do appear to be dialectal differences in L1 transfer of prosody. However, this preliminary study is not conclusive and more comprehensive investigation will be necessary.
Hideki Noda, Kyushu Institute of Technology (Japan)
Katsuya Harada, Kyushu Institute of Technology (Japan)
Eiji Kawaguchi, Kyushu Institute of Technology (Japan)
Hidefumi Sawai, Communications Research Laboratory (Japan)
This paper is concerned about speaker verification (SV) using the sequential probability ratio test (SPRT). In the SPRT input samples are usually assumed to be i.i.d. samples from a probability density function because an on-line probability computation is required. Feature vectors used in speech processing obviously do not satisfy the assumption and therefore the correlation between successive feature vectors has not been considered in conventional SV using the SPRT. The correlation can be modeled by the hidden Markov model (HMM) but unfortunately the HMM can not be directly applied to the SPRT because of statistical dependence of input samples. This paper proposes a method of HMM probability computation using the mean field approximation to resolve this problem, where the probability of whole input samples is nominally represented as the product of probability of each sample as if input samples were independent each other.
Javier Ortega-García, Universidad Politecnica de Madrid (Spain)
Santiago Cruz-Llanas, Universidad Politecnica de Madrid (Spain)
Joaquin González-Rodríguez, Universidad Politecnica de Madrid (Spain)
Regarding speaker identity in forensic conditions, several factors of variability must be taken into account, as peculiar intra-speaker variability, forced intra-speaker variability or channel-dependent external influences. Using 'AHUMADA' large speech database in Spanish, containing several recording sessions and channels, and including different tasks for 100 male speakers, automatic speaker verification experiments have been accomplished. Due to the inherent non-cooperative nature of speakers in forensic applications, only text-independent recognizers are used. In this sense, a GMM-based verification system is being used in order to obtain quantitative results. Maximum likelihood estimation of the models is performed, and LPC-cepstra, delta- and delta-delta-LPCC, are used at the parameterization stage. With this baseline verification system, we intend to determine how some variability sources included in 'AHUMADA' affect speaker identification. Results including speaking rate influence, single- and multi-session training, cross-channel testing, and kind of speech (read vs. spontaneous) are presented when likelihood-domain normalization is applied.
Thilo Pfau, Institute for Human-Machine-Communication, Technical University of Munich (Germany)
Guenther Ruske, Institute for Human-Machine-Communication, Technical University of Munich (Germany)
This paper deals with the problem of building hidden Markov models (HMMs) suitable for fast speech. First an automatic procedure is presented to split speech material into different categories according to the speaking rate. Then the problem of sparse data available for the estimation of HMMs for fast speech is discussed. A comparison of different methods to overcome this problem follows. The main emphasis here is set on robust reestimation techniques like maximum aposteriori estimation (MAP) as well as on methods to reduce the variability of the speech signal and therefore to be able to reduce the number of HMM parameters. Vocaltract length normalization (VTLN) is chosen for that purpose. Finally a comparison of various combinations of the methods discussed is presented basing on word error rates for fast speech. The best method (MAPVTLN) results in a decrease of the error rate of 10% relative to the baseline system.
Tuan Pham, Faculty of Information Sciences & Engineering, University of Canberra (Australia)
Michael Wagner, Faculty of Information Sciences & Engineering, University of Canberra (Australia)
A nonlinear probabilistic relaxation labeling for speaker identification is presented in this paper. This relaxation scheme, which is an iterative and parallel process, offers a flexible and effective framework for dealing with uncertainty inherently existing in the labeling of the speech feature vectors. Basic concepts and formulations of the relaxation algorithms are outlined. We then discuss how to model the relaxation scheme to the labeling of the speech feature vectors for the speaker identification task. The implementation is tested on a commercial speech corpus TI46. The results using several codebook sizes obtained from the proposed approach are more favorable than those from the conventional VQ (Vector Quantization)-based method.
Leandro Rodríguez-Liñares, University of Vigo (Spain)
Carmen García-Mateo, University of Vigo (Spain)
In this paper we present a novel technique for combining a Speaker Verification System with an Utterance Verification System in a Speaker Authentication system over the telephone. Speaker Verification consists in accepting or rejecting the claimed identity of a speaker by processing samples of his/her voice. Usually, these systems are based on HMM's that try to represent the characteristics of the talkers' vocal tracts. Utterance Verification systems make use of a set of speaker-independent speech models to recognize a certain utterance and decide whether a speaker has uttered it or not. If the utterances consist of passwords, this can be used for identity verification purposes. Up to now, both techniques have been used separately. This paper is focused on the problem of how to combine these two sources of information. A new architecture is presented to join an utterance verification system and a speaker verification system in order to improve the performance in a text-dependent speaker verification task.
Phil Rose, Department of Linguistics (Arts), Australian National University (Australia)
A forensic phonetic experiment is described which investigates the nature of non-contemporaneous within-speaker variation for six similar-sounding speakers. Between 8 and 10 intonationally varying tokens of the naturally produced single word utterance hello were elicited from six similar-sounding adult Australian males in two repeats separated by a reading of the "rainbow" passage. Both repeats are compared with a single batch of intonationally varying hello tokens recorded at least one year earlier. Within-speaker variation is quantified by ANOVA on mean non-contemporaneous differences and Scheffe's F for centre frequencies of the first 4 formants at 7 well-defined points in the word. Values for non-contemporaneous within-speaker between-token differences are also given, and their contribution to a Bayesian Likelihood Ratio is exemplified.
Astrid Schmidt-Nielsen, U.S. Naval Research Laboratory (USA)
Thomas H. Crystal, IDA Center for Communications Research (USA)
An experiment compared the speaker recognition performance of human listeners to that of computer algorithms/systems. Listening protocols were developed analogous to procedures used in the algorithm evaluation run by the U.S. National Institute of Standards and Technology (NIST), and the same telephone conversation data were used. For "same number" testing, with three-second samples, listener panels and the best algorithm had the same equal-error rate (EER) of 8%. Listeners were better than typical algorithms. For "different number" testing, EER's increased but humans had a 40% lower equal-error rate. Other observations on human listening performance and robustness to "degradations" were made.
Stefan Slomka, Speech Laboratory, Queensland University of Technology (Australia)
Sridha Sridharan, Speech Laboratory, Queensland University of Technology (Australia)
Vinod Chandran, Speech Laboratory, Queensland University of Technology (Australia)
Input level fusion and output level fusion methods are compared for fusing Mel-frequency Cepstral Coefficients with their corresponding delta coefficients. A 49 speaker subset of the King database is used under wideband and telephone conditions. The best input level fusion system is more computationally complex than the output level fusion system. Both input and output fusion systems were able to outperform the best purely MFCC based system for wideband data. For King telephone data, only the output level fusion based system was able to outperform the best purely MFCC based system. Further experiments using NIST'96 data under matched and mismatched conditions were also performed. Provided it was well tuned, we found that the output level fused system always outperformed the input level fused system under all experimental conditions.
Hagen Soltau, Interactive Systems Laboratories, University of Karlsruhe (Germany), Carnegie Mellon University (USA) (Germany)
Alex Waibel, Interactive Systems Laboratories, University of Karlsruhe (Germany), Carnegie Mellon University (USA) (Germany)
Since we cannot exclude that speech recognizers fail sometimes, it is important to examine how users react to recognition errors. In correction situations, speaking style becomes more accentuated to disambiguate the original mistake. We examine the effect of speaking style in such situations on speech recognition performance. Our results indicate that hyperarticulated effects occur in correction situations and decrease word accuracy significantly.
Nuala C. Ward, Alcatel Australia (Australia)
Dominik R. Dersch, Department of Electrical Engineering, University of Sydney (Australia)
This paper presents a neural network inspired approach to speaker recognition using speaker models constructed from full data sets. A similarity measure between data sets is used for text-independent speaker identification and verification. In order to reduce the computational effort in calculating the similarity measure, a fuzzy Vector Quantisation procedure is applied. This method has previously been successfully applied to a database of 108 Australian English speakers. The purpose of this paper is to apply this method to a larger benchmark database of 630 speakers (TIMIT Database). Using the full 630-speaker database, an accuracy of 98.2% (one test sentence) and 99.7% (two test sentences) was achieved for text-independent speaker identification. On a 462-speaker subset of the database a 98.5% successful acceptance and 96.9% successful rejection rate for text-independent speaker verification was achieved.
Lisa R. Yanguas, M.I.T. Lincoln Laboratory (USA)
Gerald C. O'Leary, M.I.T. Lincoln Laboratory (USA)
Marc A. Zissman, M.I.T. Lincoln Laboratory (USA)
In this paper we exploit linguistic knowledge to aid in automatic dialect identification in Spanish. Segments of extemporaneous Cuban and Peruvian Spanish dialect data from the Miami Corpus were analyzed and 49 linguistic features that occur with different rates in each of the two dialects identified and hand-labelled. We evaluate the expected performance of the dialect detection system based on a theoretical model and compute the systems' performance. Using a Gaussian classifier we show that a subset of the 49 originally-identified features obtains nearly perfect performance for discriminating between the two dialects. We compare these results with those from an automatic recognition system (PRLM-P). We then test this system in the limited domain of read digits from 0 through 10 using an orthographic transcription and hand-marked data for phone extraction and alignment. Initial experiments on phone-level segments show that phone duration and energy computations prove discriminatory for dialect discrimination.
Yiying Zhang, Department of Computer Science, Tsinghua University (China)
Xiaoyan Zhu, Department of Computer Science, Tsinghua University (China)
In this paper a new text-independent speaker verification method is proposed based on likelihood score normalization and the global speaker model, which is established to represent the universal features of speech and environment, and to normalize the likelihood score. As a result the equal error rates are decreased significantly, verification procedure is accelerated and system adaptability is improved. Two possible ways of establishing the global speaker model, one of which can meet the real-time requirement, are also suggested and discussed. Experiments demonstrate the effectiveness of this novel verification method and its improvement over the conventional method and other normalization methods.