ABSTRACT
A hybrid neural network is proposed for speaker verification (SV). The basic idea in this system is the usage of vector quantization preprocessing as the feature extractor. The experiments were carried out using a neural network model(NNM) with frame labelling performed from a client codebook known as NNM-C. Improved performance for NNM-C with more inputs and proper alignment of the speech signals supports the hypothesis that a more detailed representation of the speech patterns proved helpful for the system. The flexibility of this system allows an equal error rate (EER) of 11.2% on a single isolated digit and 0.7% on a sequence of 12 isolated digits. This paper also compares neural network speaker verification system with the more conventional method like Hidden Markov models.
ABSTRACT
Traditionally, speaker authentication has focused on two categories of techniques: speaker verification and speaker identification. In this paper, we introduce a third category called verbal information verification (VIV) in which a claimed speaker's utterances are verified against the key information in the speaker's registered profile to decide whether the claimed identity should be accepted or rejected. The proposed VIV technique can be used independently or combined with the traditional speaker verification techniques to achieve flexible and improved speaker authentication. Instead of accomplishing VIV through recognizing the key information, the proposed VIV algorithm is based on the concept of sequential utterance verification. In a telephone speaker authentication experiment on 100 speakers and using three pass-utterances in response to three categories of questions, the proposed VIV system achieved 0.00% equal-error rate, compared to 30% false rejection rate on an automatic speech recognition approach.
ABSTRACT
The main goal of this work is to develop a competitive segment- based speaker verification system that is computationally efficient. To achieve our goal, we modified SUMMIT [12] to suit our needs. The speech signal was first transformed into a hierarchical segment network using frame-based measurements. Next, acoustic models for 168 speakers were developed for a set of 6 broad phoneme classes. The models represented feature statistics with diagonal Gaussians, preceded by principle component analysis. The feature vector included segment-averaged MFCCs, plus three prosodic measurements: energy, fundamental frequency (F0), and duration. The size and content of the feature vector were determined through a greedy algorithm while optimizing overall speaker verification performance. We were able to achieve a performance of 2.74% equal error rate (EER) using cohorts during testing; and 1.59% EER using all speakers during testing. We reduced computation significantly through the use of a small number of features, a small number of phonetic models per speaker, few model parameters, and few competing speakers during testing (when cohorts are used).
ABSTRACT
This paper describes a system for controlling access to web resources built using well-known speaker verification techniques. We describe the implementation of a speech verification server and an associated authentication module for the Apache web server. Speaker verification requires two inputs: a sample of the user's speech and an identity claim for the user; typically the user's name. However a more convenient system would not require a user name to be entered. We present the results of an attempt to implement speech-only authentication using open set speaker identification. We explore the effect of database size on performance.
ABSTRACT
The problem of how to prompt a client for a password in an automatic, prompted speaker verification system is addressed. Text-prompting of four-digit sequences is compared to speech-prompting of the same sequences, and speech-prompting of four digits is compared to speech-prompting of five digits. Speech recordings are analyzed by comparing speaker verification performance and by inspecting the number and type of speaking errors that subjects made. From the experiment it is clear that text-prompting gives the subjects an easier task and fewer speaking errors are produced in that context. When enrolling clients with text-prompted speech and performing verification with an HMM-based system, the average EER was larger for speech-prompted items compared to text-prompted items, but changes in individual EERs varies across the test population.
ABSTRACT
A novel approach to scoring Gaussian mixture mod- els is presented. Feature vectors are assigned to the individual Gaussians making up the model and log-likelihoods of the separate Gaussians are computed and summed. Furthermore, the log-likelihoods of the individual Gaussians can be decomposed into sample weight, mean, and covariance log-likelihoods. Correlation likelihoods can also be computed. The results of the various systems are comparable on text-independent speaker recognition experiments despite the fact that the models and scoring are all quite dierent. By decomposing log-likelihoods of models into various sample statistic log-likelihoods, it is possible to diagnose which part of the model has the greatest discriminative power, whether the location of the Gaussians or their shapes.