ABSTRACT
In this paper, a hybrid network based on the combination of Radial Basis Function Networks (RBFNs) and Gaussian Mixture Models (GMMs) is proposed and used for speaker recognition. The hybrid network is a hierarchical one, where a GMM is built for each speaker and an RBFN is built for each group of speakers. The GMMs and RBFNs are trained independently. The RBFNs are used as a first stage coarse classifier and the GMMs are used as the final classifier. For each RBFN, only the first several candidates are chosen to take part in the final classification. The hybrid system is used for the SPIDRE database speaker recognition. Some experiments were carried out to choose the proper structure and parameters of RBFNs and GMMs. After using RBFNs, about 40% speakers were excluded without decreasing the performance. If the most confusable speaker sets in GMMs are grouped into RBFNs, the performance of GMMs can be increased more by using RBFNs.
ABSTRACT
The Gaussian mixture speaker model (GMM) is usually trained with the expectation-maximization (EM) algorithm to maximize the likelihood (ML) of observation data from an individual class. The GMM trained based the ML criterion has weak discriminative power when used as a classifier. In this paper, a discriminative training procedure is proposed to fine-tune the parameters in the GMMs. The goal of the training is to reduce the number of misclassified vector groups. Since a vector group can be thought as derived from a short sentence, this training procedure optimize the speaker identification performance more directly. Even though the algorithm itself is based on an heuristic idea, it works fine for many practical problems. Besides, the training speed is very fast. In an evaluation experiment with the YOHO database, when each speaker is modeled with 8 mixtures, the identification rate increases from 83.8% to 92.4% after applying this discriminative training algorithm.
ABSTRACT
This paper compares two approaches to background model representation for a text-independent speaker verification task using Gaussian mixture models. We compare speaker-dependent background speaker sets to the use of a universal, speaker-independent background model (UBM). For the UBM, we describe how Bayesian adaptation can be used to derive claimant speaker models, providing a structure leading to significant computational savings during recognition. Experiments are conducted on the 1996 NIST Speaker Recognition Evaluation corpus and it is clearly shown that a system using a UBM and Bayesian adaptation of claimant models produces superior performance compared to speaker-dependent background sets or the UBM with independent claimant models. In addition, the creation and use of a telephone handset-type detector and a procedure called hnorm is also described which shows further, large improvements in verification performance, especially under the difficult mismatched handset conditions. This is believed to be the first use of applying a hand-set- type detector and explicit handset-type normalization for the speaker verification task.
ABSTRACT
New methods for speaker verification that address the problems of limited training data and unknown telephone channel are presented. We describe a system for studying the feasibility of telephone based voice signatures for electronic documents that uses speaker verification with a fixed test phrase but very limited data for training speaker models. We examine three methods for speaker verification that address these characteristics in different ways, including text-independent mixture models, a broad phonetic category model that has some of the properties of both text-dependent and text-independent approaches, and a text-dependent approach based on speaker adaptation. The speaker-adaptive approach is shown to have significantly better performance when the training and test channel conditions are mismatched, resulting in better overall performance across all conditions.
ABSTRACT
This paper summarizes the main results from the Speaker Verification (SV) research pursued so far in the CAVE project. Different state-of the art SV algo- rithms were implemented in a common HMM frame- work and compared on two databases : YOHO (of fice environment speech) and SESP (telephone speech). This paper is concerned with the different design issues for LR-HMM-based SV algorithms which emerged from our investigations and which led to our current SV sys- tem, which delivers Equal Error Rates below 0.5 % on a very realistic telephone speech database.
ABSTRACT
In this paper we investigate the impact on the performance of Speaker Verification (SV) systems of the signal and channel coding in GSM cellular telephone networks. In this study only the effects of the codec are investigated. This is done by transcoding the signals in an existing speech corpus, recorded in the fixed network, to GSM. We compared text dependent SV performance of systems trained with A-law speech and tested with A-law and GSM speech, as well as systems trained with GSM speech and tested with GSM speech. All SV systems compared were based on continuous density Gaussian mixtures HMM models, differing in acoustic resolution. We have compared several parameter representations derived from FFT and LPC based spectral estimates. It is shown that (and why) LPC based estimates are to be preferred. It is also shown that it pays to extend the analysis bandwidth to the full 4 kHz offered by the digital telephone network. The major conclusion of our research is that the impact of GSM coding on the parameter representations is marginal and can effectively be ignored.