%% Tutorial on text-dependent speaker recognition % In this tutorial, we shall explain the basics of text-dependent speaker identification using MFCC as features and DTW as comparison method. % The dataset is availabe upon request. %% Preprocessing % Before we start, let's add necessary toolboxes to the search path of MATLAB: addpath d:/users/jang/matlab/toolbox/utility addpath d:/users/jang/matlab/toolbox/sap addpath d:/users/jang/matlab/toolbox/machineLearning %% % Note that all the above toolboxes can be downloaded from the author's . %% % For compatibility, here we list the platform and MATLAB version that we used to run this script: fprintf('Platform: %s\n', computer); fprintf('MATLAB version: %s\n', version); scriptStartTime=tic; %% Options for speaker identification % First of all, all the options for our speaker identification can be obtain as follows: sidOpt=sidOptSet %% % The contents of sidOpt can be shown next: type sidOptSet.m %% % And we need to create the output folder if it doesn't exist: if ~exist(sidOpt.outputDir, 'dir'), fprintf('Creating %s...\n', sidOpt.outputDir); mkdirs(sidOpt.outputDir); end %% Dataset construction % Here we have some facts about the speaker-ID corpus for this tutorial: % % * All the audio clips are recorded by a single mobile phone to reduce channel distortion. % * Each person is required to record 3 times for each of 10 speech passwords, leading to a total of 30 clips per person. % * We have close to 100 persons for the recordings. % * There are 2 sessions for recordings with a week apart. We shall use recordings in session 1 as the training data and those in session 2 as the test data. % % To collect all the audio files and perform feature extraction, we can invoke "sidFeaExtract" as follows: tic [speakerSet1, speakerSet2]=sidFeaExtract(sidOpt); fprintf('Elapsed time = %g sec\n', toc); %% % Note the extracted data will be stored as a mat file for future use. % % To evaluate the performance, we can invoke "sidPerfEval": tic [overallRr, speakerSet1, time]=sidPerfEval(speakerSet1, speakerSet2, sidOpt, 1); fprintf('Elapsed time = %g sec\n', toc); fprintf('Saving %s/speakerSet1.mat...\n', sidOpt.outputDir); eval(sprintf('save %s/speakerSet1 speakerSet1', sidOpt.outputDir)); %% Post analysis % We can display the scatter plots based on the features of each sentence, % in order to visualize if it is possible to separate "bad" utterances from % "good" ones: sentence=[speakerSet1.sentence]; DS.input=[[sentence.meanVolume]; [sentence.meanClarity]; [sentence.medianPitch]; [sentence.minDistance]; [sentence.frameNum]]; DS.inputName={'meanVolume', 'meanClarity', 'medianPitch', 'minDistance', 'frameNum'}; DS.input=inputNormalize(DS.input); DS.output=2-[sentence.correct]; dsProjPlot2(DS); figEnlarge; eval(sprintf('print -dpng %s/dtwDataDistribution', sidOpt.outputDir)); %% Summary % This is a brief tutorial on text-dependent speaker recognition based on DTW. % There are several directions for further improvement: % % * Explore other features, such as magnitude spectra or filter bank coefficients. % * Try flexible anchor positions for DTW. % * Use rejection to improve the overall system's accuracy. % %% Appendix % List of functions used in this script % % * <../list.asp List of all files> % %% % Overall elapsed time: toc(scriptStartTime) %% % , created on datetime %% % If you are interested in the original MATLAB code for this page, you can % type "grabcode(URL)" under MATLAB, where URL is the web address of this % page.