Tutorial on text-dependent speaker recognition

In this tutorial, we shall explain the basics of text-dependent speaker identification using MFCC as features and DTW as comparison method. The dataset is availabe upon request.

Contents

Preprocessing

Before we start, let's add necessary toolboxes to the search path of MATLAB:

addpath d:/users/jang/matlab/toolbox/utility
addpath d:/users/jang/matlab/toolbox/sap
addpath d:/users/jang/matlab/toolbox/machineLearning

Note that all the above toolboxes can be downloaded from the author's toolbox page.

For compatibility, here we list the platform and MATLAB version that we used to run this script:

fprintf('Platform: %s\n', computer);
fprintf('MATLAB version: %s\n', version);
scriptStartTime=tic;
Platform: PCWIN64
MATLAB version: 8.5.0.197613 (R2015a)

Options for speaker identification

First of all, all the options for our speaker identification can be obtain as follows:

sidOpt=sidOptSet
sidOpt = 

                      method: 'dtw'
                    useGtEpd: 0
                      epdFcn: 'epdByVol'
                     auDir01: 'd:\dataSet\mir-2010-speakerId_label\session01'
                     auDir02: 'd:\dataSet\mir-2010-speakerId_label\session02'
                   outputDir: 'output/mir-2010-speakerId_label'
                errorLogFile: 'output/mir-2010-speakerId_label/error.log'
               maxSpeakerNum: Inf
       sentenceNumPerSpeaker: Inf
                     mixData: 0
                     feaType: 'mfcc'
                   useEnergy: 1
            temporalNormMode: 1
            feaTransformMode: 0
           transformedFeaDim: 10
    useWaveformNormalization: 0
                   useIntFea: 0
                     dtwFunc: 'dtw2'
    useDtwPartialComputation: 1

The contents of sidOpt can be shown next:

type sidOptSet.m
function sidOpt=sidOptSet
% sidOptSet: Set parameters for speaker identification/verification 

% ====== Add necessary paths
addpath d:/users/jang/matlab/toolbox/utility
addpath d:/users/jang/matlab/toolbox/machineLearning
addpath d:/users/jang/matlab/toolbox/sap
addpath d:/users/jang/matlab/toolbox/asr -end

% ====== Method
sidOpt.method='dtw';			% 'dtw' (for text-dependent speaker ID) or 'gmm' (for text-independent speaker ID)
sidOpt.useGtEpd=0;				% Use GT EPD labelled by human, only available for "mir-2010-speakerId_label"
sidOpt.epdFcn='epdByVol';		% epdByVol is better for GMM.

% ====== Corpus directories
%sidOpt.auDir01='d:/dataset/鈦映科技-2008-語者辨識/第一次錄音檔(100人)';	% Session 1
%sidOpt.auDir02='d:/dataset/鈦映科技-2008-語者辨識/第二次錄音檔(100人)';	% Session 2
sidOpt.auDir01='d:\dataSet\mir-2010-speakerId_label\session01';		% Session 1, with GT ednpoints
sidOpt.auDir02='d:\dataSet\mir-2010-speakerId_label\session02';		% Session 2, with GT ednpoints
% ====== Output dirs and files
[parentDir, mainName]=fileparts(fileparts(sidOpt.auDir01));
sidOpt.outputDir=['output/', mainName];		% Output directory, such as 'mir-2010-speakerId_label'
sidOpt.errorLogFile=[sidOpt.outputDir, '/error.log'];
% ====== Use partial data for fast verification
sidOpt.maxSpeakerNum=3;
sidOpt.maxSpeakerNum=inf;
sidOpt.sentenceNumPerSpeaker=3;		% Use 3 utterance (1 password) per person, for fast computation
sidOpt.sentenceNumPerSpeaker=inf;
% === Evaluation mode
sidOpt.mixData=0;	% 1 for mix mode (two datasets are mixed, odd-indexed for training and even-indexed for testing)

% ====== Feature/waveform related
% === Feature type
sidOpt.feaType='mfcc';		% 'mfcc', 'spectrum', 'volume', 'pitch'
sidOpt.useEnergy=1;			% mfcc dim=13 if useEnergy = 1; mfcc dim=12 if useEnergy=0
% === Temporal normalization mode
sidOpt.temporalNormMode=1;		% 0: nothing, 1: CMS (cepstrum mean subtraction), 2: CN (cepstrum normalization)
% === Feature transformation mode
sidOpt.feaTransformMode=0;		% 0: nothing, 1: LDA, 2: PCA
sidOpt.transformedFeaDim=10;	% This is used only when the above option is 1 or 2
% === Feature/waveform options
sidOpt.useWaveformNormalization=0;	% 0 or 1
% === Use integer feature for embedded devices
sidOpt.useIntFea=0;			% Use integer feature (MFCC) for smartphone

switch(sidOpt.method)
	case 'gmm'	% ====== GMM parameters
		% === GMM model parameters
		sidOpt.exponent4gaussianNum=1:8;
		sidOpt.gaussianNums=2.^sidOpt.exponent4gaussianNum;
		sidOpt.covType=1;
		% === GMM training parameters
		sidOpt.gmmTrainParam.dispOpt=0;
		sidOpt.gmmTrainParam.useKmeans=1;
		sidOpt.gmmTrainParam.maxIteration=50;
		% === Use integer GMM for smartphone
		sidOpt.useIntGmm=0;
	case 'dtw'	% ====== DTW parameters
		% === DTW type
		sidOpt.dtwFunc='dtw2';		% Set this to 'dtw1' (27-45-63) or 'dtw2' (0-45-90)
		sidOpt.useDtwPartialComputation=1;	% Speedup via partial computation (does not speed up significantly)
	otherwise
		error('Unknown method!');
end

And we need to create the output folder if it doesn't exist:

if ~exist(sidOpt.outputDir, 'dir'), fprintf('Creating %s...\n', sidOpt.outputDir); mkdirs(sidOpt.outputDir); end

Dataset construction

Here we have some facts about the speaker-ID corpus for this tutorial:

To collect all the audio files and perform feature extraction, we can invoke "sidFeaExtract" as follows:

tic
[speakerSet1, speakerSet2]=sidFeaExtract(sidOpt);
fprintf('Elapsed time = %g sec\n', toc);
Get wave info of 107 persons from auDir01=d:\dataSet\mir-2010-speakerId_label\session01
1/107: Feature extraction from 30 recordings by Kannan#1 ===> 6.03615 sec
2/107: Feature extraction from 30 recordings by claire#0 ===> 5.87263 sec
3/107: Feature extraction from 30 recordings by esther#0 ===> 5.47457 sec
4/107: Feature extraction from 30 recordings by jerry#1 ===> 5.5619 sec
5/107: Feature extraction from 30 recordings by joe#0 ===> 5.92628 sec
6/107: Feature extraction from 30 recordings by stanley#1 ===> 6.03701 sec
7/107: Feature extraction from 30 recordings by warren#1 ===> 6.22421 sec
8/107: Feature extraction from 30 recordings by 丁凱元#1 ===> 5.92928 sec
9/107: Feature extraction from 30 recordings by 任佳王民#1 ===> 5.66416 sec
10/107: Feature extraction from 30 recordings by 劉俊宏#1 ===> 5.80716 sec
11/107: Feature extraction from 30 recordings by 劉怡芬#0 ===> 6.09532 sec
12/107: Feature extraction from 30 recordings by 卓楷斌#1 ===> 6.23129 sec
13/107: Feature extraction from 30 recordings by 吳亮辰#1 ===> 6.26985 sec
14/107: Feature extraction from 30 recordings by 吳俊慶#1 ===> 6.17143 sec
15/107: Feature extraction from 30 recordings by 吳偉廷#1 ===> 5.90378 sec
16/107: Feature extraction from 30 recordings by 吳明儒#1 ===> 6.0619 sec
17/107: Feature extraction from 30 recordings by 吳福海#1 ===> 5.7969 sec
18/107: Feature extraction from 30 recordings by 周哲民#1 ===> 5.87979 sec
19/107: Feature extraction from 30 recordings by 周哲玄#1 ===> 5.85658 sec
20/107: Feature extraction from 30 recordings by 周進財#1 ===> 5.45193 sec
21/107: Feature extraction from 30 recordings by 周進雄#1 ===> 5.48451 sec
22/107: Feature extraction from 30 recordings by 唐若華#0 ===> 5.72251 sec
23/107: Feature extraction from 30 recordings by 姜折予#1 ===> 5.69752 sec
24/107: Feature extraction from 30 recordings by 廖育志#1 ===> 6.30388 sec
25/107: Feature extraction from 30 recordings by 廖韋嵐#0 ===> 6.47847 sec
26/107: Feature extraction from 30 recordings by 張智星#1 ===> 6.10723 sec
27/107: Feature extraction from 30 recordings by 張雅雯#0 ===> 5.74913 sec
28/107: Feature extraction from 30 recordings by 彭郁雅#0 ===> 6.05525 sec
29/107: Feature extraction from 30 recordings by 徐偉智#1 ===> 6.54502 sec
30/107: Feature extraction from 30 recordings by 徐君潔#0 ===> 6.05383 sec
31/107: Feature extraction from 30 recordings by 徐培霖#1 ===> 6.08081 sec
32/107: Feature extraction from 30 recordings by 徐懿荷#0 ===> 6.10894 sec
33/107: Feature extraction from 30 recordings by 徐韻媜#0 ===> 5.40952 sec
34/107: Feature extraction from 30 recordings by 方一帆#1 ===> 5.88336 sec
35/107: Feature extraction from 30 recordings by 曾泓熹#1 ===> 6.5154 sec
36/107: Feature extraction from 30 recordings by 朋瑞雲#0 ===> 5.8184 sec
37/107: Feature extraction from 30 recordings by 李函軒#1 ===> 6.26798 sec
38/107: Feature extraction from 30 recordings by 李哲維#1 ===> 5.77028 sec
39/107: Feature extraction from 30 recordings by 李宗奇#1 ===> 5.78297 sec
40/107: Feature extraction from 30 recordings by 李怡欣#0 ===> 5.65346 sec
41/107: Feature extraction from 30 recordings by 李芝宇#0 ===> 6.09142 sec
42/107: Feature extraction from 30 recordings by 李藺芳#0 ===> 6.06205 sec
43/107: Feature extraction from 30 recordings by 杜承恩#1 ===> 6.11568 sec
44/107: Feature extraction from 30 recordings by 林佳廷#1 ===> 6.14981 sec
45/107: Feature extraction from 30 recordings by 林志翰#1 ===> 6.45175 sec
46/107: Feature extraction from 30 recordings by 林應耀#1 ===> 12.236 sec
47/107: Feature extraction from 30 recordings by 林昱豪#1 ===> 6.26051 sec
48/107: Feature extraction from 30 recordings by 林琪家#1 ===> 5.85881 sec
49/107: Feature extraction from 30 recordings by 林立緯#1 ===> 6.17326 sec
50/107: Feature extraction from 30 recordings by 林美慧#0 ===> 6.2486 sec
51/107: Feature extraction from 30 recordings by 梁啟輝#1 ===> 5.83131 sec
52/107: Feature extraction from 30 recordings by 楊子睿#1 ===> 5.67739 sec
53/107: Feature extraction from 30 recordings by 楊宗樺#1 ===> 5.8882 sec
54/107: Feature extraction from 30 recordings by 楊惠敏#0 ===> 6.10571 sec
55/107: Feature extraction from 30 recordings by 楊振緯#1 ===> 6.02235 sec
56/107: Feature extraction from 30 recordings by 江育儒#1 ===> 5.97456 sec
57/107: Feature extraction from 30 recordings by 汪世婕#0 ===> 5.54479 sec
58/107: Feature extraction from 30 recordings by 汪緒中#1 ===> 6.42422 sec
59/107: Feature extraction from 30 recordings by 游鎮洋#1 ===> 5.77039 sec
60/107: Feature extraction from 30 recordings by 王俊凱#1 ===> 5.8785 sec
61/107: Feature extraction from 30 recordings by 王小龜#1 ===> 6.03116 sec
62/107: Feature extraction from 30 recordings by 王怡萱#0 ===> 5.88747 sec
63/107: Feature extraction from 30 recordings by 王瑩#0 ===> 5.93605 sec
64/107: Feature extraction from 30 recordings by 王美玲#0 ===> 5.78493 sec
65/107: Feature extraction from 30 recordings by 白宗儒#1 ===> 6.13492 sec
66/107: Feature extraction from 30 recordings by 簡嘉宏#1 ===> 5.69288 sec
67/107: Feature extraction from 30 recordings by 簡祐祥#1 ===> 5.76921 sec
68/107: Feature extraction from 30 recordings by 羅尹聰#1 ===> 6.1957 sec
69/107: Feature extraction from 30 recordings by 胡任桓#1 ===> 6.21709 sec
70/107: Feature extraction from 30 recordings by 葉子雋#1 ===> 5.90803 sec
71/107: Feature extraction from 30 recordings by 董姵汝#0 ===> 5.85089 sec
72/107: Feature extraction from 30 recordings by 蔡佩京#0 ===> 5.41845 sec
73/107: Feature extraction from 30 recordings by 蔡耀陞#1 ===> 6.02589 sec
74/107: Feature extraction from 30 recordings by 薛光利#0 ===> 5.97864 sec
75/107: Feature extraction from 30 recordings by 蘇雅雯#0 ===> 5.46662 sec
76/107: Feature extraction from 30 recordings by 衛帝安#1 ===> 6.51813 sec
77/107: Feature extraction from 30 recordings by 許凱華#1 ===> 5.6176 sec
78/107: Feature extraction from 30 recordings by 許書豪#1 ===> 5.91759 sec
79/107: Feature extraction from 30 recordings by 謝僑威#1 ===> 5.81973 sec
80/107: Feature extraction from 30 recordings by 賴俊龍#1 ===> 6.38432 sec
81/107: Feature extraction from 30 recordings by 賴郡曄#1 ===> 5.73334 sec
82/107: Feature extraction from 30 recordings by 邱莉婷#0 ===> 6.57198 sec
83/107: Feature extraction from 30 recordings by 郭哲綸#1 ===> 6.85801 sec
84/107: Feature extraction from 30 recordings by 郭湧鈐#1 ===> 6.11993 sec
85/107: Feature extraction from 30 recordings by 鄭宇志#1 ===> 6.66321 sec
86/107: Feature extraction from 30 recordings by 鄭淑惠#0 ===> 5.46216 sec
87/107: Feature extraction from 30 recordings by 鄭鈞蔚#1 ===> 5.91633 sec
88/107: Feature extraction from 30 recordings by 陳亦敦#1 ===> 6.41686 sec
89/107: Feature extraction from 30 recordings by 陳亮宇#1 ===> 6.21827 sec
90/107: Feature extraction from 30 recordings by 陳俊達#1 ===> 6.19736 sec
91/107: Feature extraction from 30 recordings by 陳偉豪#1 ===> 5.68705 sec
92/107: Feature extraction from 30 recordings by 陳冠宇#1 ===> 6.2825 sec
93/107: Feature extraction from 30 recordings by 陳威翰#1 ===> 6.11333 sec
94/107: Feature extraction from 30 recordings by 陳宏瑞#1 ===> 6.06913 sec
95/107: Feature extraction from 30 recordings by 陳揚昇#1 ===> 6.50936 sec
96/107: Feature extraction from 30 recordings by 陳易正#1 ===> 6.36014 sec
97/107: Feature extraction from 30 recordings by 陳朝煒#1 ===> 5.99008 sec
98/107: Feature extraction from 30 recordings by 陳杰興#1 ===> 6.4591 sec
99/107: Feature extraction from 30 recordings by 陳永強#1 ===> 5.93751 sec
100/107: Feature extraction from 30 recordings by 高佳慧#0 ===> 6.16572 sec
101/107: Feature extraction from 30 recordings by 魏宇晨#0 ===> 5.87014 sec
102/107: Feature extraction from 30 recordings by 黃昌傑#1 ===> 6.28002 sec
103/107: Feature extraction from 30 recordings by 黃永漢#1 ===> 5.8356 sec
104/107: Feature extraction from 30 recordings by 黃秀惠#0 ===> 5.65923 sec
105/107: Feature extraction from 30 recordings by 黃羿銘#1 ===> 6.36822 sec
106/107: Feature extraction from 30 recordings by 黃韋中#1 ===> 6.11534 sec
107/107: Feature extraction from 30 recordings by 龍慧容#0 ===> 5.65124 sec
Get wave info of 98 persons from auDir02=d:\dataSet\mir-2010-speakerId_label\session02
1/98: Feature extraction from 30 recordings by Kannan#1 ===> 6.00379 sec
2/98: Feature extraction from 30 recordings by 丁凱元#1 ===> 5.76945 sec
3/98: Feature extraction from 30 recordings by 任佳王民#1 ===> 5.71456 sec
4/98: Feature extraction from 30 recordings by 劉俊宏#1 ===> 5.88281 sec
5/98: Feature extraction from 30 recordings by 劉怡芬#0 ===> 6.11281 sec
6/98: Feature extraction from 30 recordings by 卓楷斌#1 ===> 6.12098 sec
7/98: Feature extraction from 30 recordings by 吳亮辰#1 ===> 6.37207 sec
8/98: Feature extraction from 30 recordings by 吳俊慶#1 ===> 6.33073 sec
9/98: Feature extraction from 30 recordings by 吳偉廷#1 ===> 5.80994 sec
10/98: Feature extraction from 30 recordings by 吳明儒#1 ===> 6.16373 sec
11/98: Feature extraction from 30 recordings by 周哲民#1 ===> 5.9398 sec
12/98: Feature extraction from 30 recordings by 周哲玄#1 ===> 5.84105 sec
13/98: Feature extraction from 30 recordings by 周進財#1 ===> 5.44694 sec
14/98: Feature extraction from 30 recordings by 周進雄#1 ===> 5.71308 sec
15/98: Feature extraction from 30 recordings by 唐若華#0 ===> 5.57059 sec
16/98: Feature extraction from 30 recordings by 廖育志#1 ===> 6.29964 sec
17/98: Feature extraction from 30 recordings by 廖韋嵐#0 ===> 6.10599 sec
18/98: Feature extraction from 30 recordings by 張智星#1 ===> 6.02728 sec
19/98: Feature extraction from 30 recordings by 彭郁雅#0 ===> 5.67838 sec
20/98: Feature extraction from 30 recordings by 徐偉智#1 ===> 6.64439 sec
21/98: Feature extraction from 30 recordings by 徐君潔#0 ===> 5.98807 sec
22/98: Feature extraction from 30 recordings by 徐培霖#1 ===> 6.15709 sec
23/98: Feature extraction from 30 recordings by 徐懿荷#0 ===> 6.05672 sec
24/98: Feature extraction from 30 recordings by 徐韻媜#0 ===> 5.73458 sec
25/98: Feature extraction from 30 recordings by 方一帆#1 ===> 5.84109 sec
26/98: Feature extraction from 30 recordings by 曾泓熹#1 ===> 6.37955 sec
27/98: Feature extraction from 30 recordings by 朋瑞雲#0 ===> 5.62741 sec
28/98: Feature extraction from 30 recordings by 李函軒#1 ===> 6.04633 sec
29/98: Feature extraction from 30 recordings by 李哲維#1 ===> 5.88382 sec
30/98: Feature extraction from 30 recordings by 李宗奇#1 ===> 5.74812 sec
31/98: Feature extraction from 30 recordings by 李怡欣#0 ===> 5.60369 sec
32/98: Feature extraction from 30 recordings by 李芝宇#0 ===> 6.05522 sec
33/98: Feature extraction from 30 recordings by 李藺芳#0 ===> 6.16137 sec
34/98: Feature extraction from 30 recordings by 杜承恩#1 ===> 6.25205 sec
35/98: Feature extraction from 30 recordings by 林佳廷#1 ===> 5.98321 sec
36/98: Feature extraction from 30 recordings by 林志翰#1 ===> 6.42471 sec
37/98: Feature extraction from 30 recordings by 林應耀#1 ===> 6.31524 sec
38/98: Feature extraction from 30 recordings by 林昱豪#1 ===> 5.77862 sec
39/98: Feature extraction from 30 recordings by 林琪家#1 ===> 5.61936 sec
40/98: Feature extraction from 30 recordings by 林立緯#1 ===> 6.38166 sec
41/98: Feature extraction from 30 recordings by 林美慧#0 ===> 6.01842 sec
42/98: Feature extraction from 30 recordings by 梁啟輝#1 ===> 5.68654 sec
43/98: Feature extraction from 30 recordings by 楊子睿#1 ===> 5.63136 sec
44/98: Feature extraction from 30 recordings by 楊宗樺#1 ===> 5.70206 sec
45/98: Feature extraction from 30 recordings by 楊惠敏#0 ===> 6.01246 sec
46/98: Feature extraction from 30 recordings by 楊振緯#1 ===> 5.70081 sec
47/98: Feature extraction from 30 recordings by 江育儒#1 ===> 5.84538 sec
48/98: Feature extraction from 30 recordings by 汪世婕#0 ===> 5.56888 sec
49/98: Feature extraction from 30 recordings by 汪緒中#1 ===> 6.03056 sec
50/98: Feature extraction from 30 recordings by 游鎮洋#1 ===> 5.86925 sec
51/98: Feature extraction from 30 recordings by 王俊凱#1 ===> 5.83424 sec
52/98: Feature extraction from 30 recordings by 王小龜#1 ===> 6.13621 sec
53/98: Feature extraction from 30 recordings by 王怡萱#0 ===> 5.85586 sec
54/98: Feature extraction from 30 recordings by 王瑩#0 ===> 5.37184 sec
55/98: Feature extraction from 30 recordings by 王美玲#0 ===> 5.84703 sec
56/98: Feature extraction from 30 recordings by 白宗儒#1 ===> 6.08596 sec
57/98: Feature extraction from 30 recordings by 簡嘉宏#1 ===> 5.71412 sec
58/98: Feature extraction from 30 recordings by 簡祐祥#1 ===> 5.88589 sec
59/98: Feature extraction from 30 recordings by 羅尹聰#1 ===> 6.16978 sec
60/98: Feature extraction from 30 recordings by 胡任桓#1 ===> 6.23389 sec
61/98: Feature extraction from 30 recordings by 葉子雋#1 ===> 5.92687 sec
62/98: Feature extraction from 30 recordings by 董姵汝#0 ===> 5.73743 sec
63/98: Feature extraction from 30 recordings by 蔡佩京#0 ===> 5.43256 sec
64/98: Feature extraction from 30 recordings by 蔡耀陞#1 ===> 5.70221 sec
65/98: Feature extraction from 30 recordings by 薛光利#0 ===> 5.78549 sec
66/98: Feature extraction from 30 recordings by 蘇雅雯#0 ===> 5.41802 sec
67/98: Feature extraction from 30 recordings by 衛帝安#1 ===> 6.48666 sec
68/98: Feature extraction from 30 recordings by 許凱華#1 ===> 5.58788 sec
69/98: Feature extraction from 30 recordings by 許書豪#1 ===> 5.88817 sec
70/98: Feature extraction from 30 recordings by 謝僑威#1 ===> 5.80167 sec
71/98: Feature extraction from 30 recordings by 賴俊龍#1 ===> 6.25938 sec
72/98: Feature extraction from 30 recordings by 賴郡曄#1 ===> 5.85933 sec
73/98: Feature extraction from 30 recordings by 邱莉婷#0 ===> 6.52926 sec
74/98: Feature extraction from 30 recordings by 郭哲綸#1 ===> 6.5783 sec
75/98: Feature extraction from 30 recordings by 郭湧鈐#1 ===> 5.96002 sec
76/98: Feature extraction from 30 recordings by 鄭宇志#1 ===> 6.76243 sec
77/98: Feature extraction from 30 recordings by 鄭鈞蔚#1 ===> 5.69328 sec
78/98: Feature extraction from 30 recordings by 陳亦敦#1 ===> 6.04649 sec
79/98: Feature extraction from 30 recordings by 陳亮宇#1 ===> 6.0848 sec
80/98: Feature extraction from 30 recordings by 陳俊達#1 ===> 6.0293 sec
81/98: Feature extraction from 30 recordings by 陳偉豪#1 ===> 5.47703 sec
82/98: Feature extraction from 30 recordings by 陳冠宇#1 ===> 6.44892 sec
83/98: Feature extraction from 30 recordings by 陳奕敦#1 ===> 5.99439 sec
84/98: Feature extraction from 30 recordings by 陳威翰#1 ===> 6.04493 sec
85/98: Feature extraction from 30 recordings by 陳宏瑞#1 ===> 5.8045 sec
86/98: Feature extraction from 30 recordings by 陳揚昇#1 ===> 6.36775 sec
87/98: Feature extraction from 30 recordings by 陳易正#1 ===> 6.26893 sec
88/98: Feature extraction from 30 recordings by 陳朝煒#1 ===> 5.77438 sec
89/98: Feature extraction from 30 recordings by 陳杰興#1 ===> 6.40115 sec
90/98: Feature extraction from 30 recordings by 陳永強#1 ===> 5.76188 sec
91/98: Feature extraction from 30 recordings by 高佳慧#0 ===> 6.08165 sec
92/98: Feature extraction from 30 recordings by 魏宇晨#0 ===> 6.00005 sec
93/98: Feature extraction from 30 recordings by 黃昌傑#1 ===> 6.36657 sec
94/98: Feature extraction from 30 recordings by 黃永漢#1 ===> 5.58237 sec
95/98: Feature extraction from 30 recordings by 黃秀惠#0 ===> 5.49566 sec
96/98: Feature extraction from 30 recordings by 黃羿銘#1 ===> 6.3831 sec
97/98: Feature extraction from 30 recordings by 黃韋中#1 ===> 6.26679 sec
98/98: Feature extraction from 30 recordings by 龍慧容#0 ===> 5.69149 sec
Saving output/mir-2010-speakerId_label/speakerSet.mat...
sidFeaExtract ===> 1234.64 seconds
Elapsed time = 6.63111 sec

Note the extracted data will be stored as a mat file for future use.

To evaluate the performance, we can invoke "sidPerfEval":

tic
[overallRr, speakerSet1, time]=sidPerfEval(speakerSet1, speakerSet2, sidOpt, 1);
fprintf('Elapsed time = %g sec\n', toc);
1/97: speaker=Kannan#1
	RR for Kannan#1 = 100.00%, ave. time = 0.10 sec
2/97: speaker=丁凱元#1
	RR for 丁凱元#1 = 100.00%, ave. time = 0.09 sec
3/97: speaker=任佳王民#1
	RR for 任佳王民#1 = 83.33%, ave. time = 0.08 sec
4/97: speaker=劉俊宏#1
	RR for 劉俊宏#1 = 100.00%, ave. time = 0.09 sec
5/97: speaker=劉怡芬#0
	RR for 劉怡芬#0 = 100.00%, ave. time = 0.12 sec
6/97: speaker=卓楷斌#1
	RR for 卓楷斌#1 = 100.00%, ave. time = 0.11 sec
7/97: speaker=吳亮辰#1
	RR for 吳亮辰#1 = 100.00%, ave. time = 0.12 sec
8/97: speaker=吳俊慶#1
	RR for 吳俊慶#1 = 100.00%, ave. time = 0.12 sec
9/97: speaker=吳偉廷#1
	RR for 吳偉廷#1 = 100.00%, ave. time = 0.10 sec
10/97: speaker=吳明儒#1
	RR for 吳明儒#1 = 93.33%, ave. time = 0.11 sec
11/97: speaker=周哲民#1
	RR for 周哲民#1 = 96.67%, ave. time = 0.10 sec
12/97: speaker=周哲玄#1
	RR for 周哲玄#1 = 76.67%, ave. time = 0.10 sec
13/97: speaker=周進財#1
	RR for 周進財#1 = 100.00%, ave. time = 0.08 sec
14/97: speaker=周進雄#1
	RR for 周進雄#1 = 100.00%, ave. time = 0.07 sec
15/97: speaker=唐若華#0
	RR for 唐若華#0 = 100.00%, ave. time = 0.10 sec
16/97: speaker=廖育志#1
	RR for 廖育志#1 = 100.00%, ave. time = 0.12 sec
17/97: speaker=廖韋嵐#0
	RR for 廖韋嵐#0 = 100.00%, ave. time = 0.12 sec
18/97: speaker=張智星#1
	RR for 張智星#1 = 100.00%, ave. time = 0.10 sec
19/97: speaker=彭郁雅#0
	RR for 彭郁雅#0 = 86.67%, ave. time = 0.11 sec
20/97: speaker=徐偉智#1
	RR for 徐偉智#1 = 100.00%, ave. time = 0.12 sec
21/97: speaker=徐君潔#0
	RR for 徐君潔#0 = 100.00%, ave. time = 0.12 sec
22/97: speaker=徐培霖#1
	RR for 徐培霖#1 = 100.00%, ave. time = 0.11 sec
23/97: speaker=徐懿荷#0
	RR for 徐懿荷#0 = 93.33%, ave. time = 0.11 sec
24/97: speaker=徐韻媜#0
	RR for 徐韻媜#0 = 96.67%, ave. time = 0.08 sec
25/97: speaker=方一帆#1
	RR for 方一帆#1 = 100.00%, ave. time = 0.09 sec
26/97: speaker=曾泓熹#1
	RR for 曾泓熹#1 = 93.33%, ave. time = 0.12 sec
27/97: speaker=朋瑞雲#0
	RR for 朋瑞雲#0 = 100.00%, ave. time = 0.10 sec
28/97: speaker=李函軒#1
	RR for 李函軒#1 = 100.00%, ave. time = 0.12 sec
29/97: speaker=李哲維#1
	RR for 李哲維#1 = 100.00%, ave. time = 0.10 sec
30/97: speaker=李宗奇#1
	RR for 李宗奇#1 = 100.00%, ave. time = 0.10 sec
31/97: speaker=李怡欣#0
	RR for 李怡欣#0 = 86.67%, ave. time = 0.10 sec
32/97: speaker=李芝宇#0
	RR for 李芝宇#0 = 100.00%, ave. time = 0.12 sec
33/97: speaker=李藺芳#0
	RR for 李藺芳#0 = 100.00%, ave. time = 0.12 sec
34/97: speaker=杜承恩#1
	RR for 杜承恩#1 = 100.00%, ave. time = 0.11 sec
35/97: speaker=林佳廷#1
	RR for 林佳廷#1 = 96.67%, ave. time = 0.11 sec
36/97: speaker=林志翰#1
	RR for 林志翰#1 = 96.67%, ave. time = 0.12 sec
37/97: speaker=林應耀#1
	RR for 林應耀#1 = 100.00%, ave. time = 0.12 sec
38/97: speaker=林昱豪#1
	RR for 林昱豪#1 = 90.00%, ave. time = 0.09 sec
39/97: speaker=林琪家#1
	RR for 林琪家#1 = 86.67%, ave. time = 0.08 sec
40/97: speaker=林立緯#1
	RR for 林立緯#1 = 83.33%, ave. time = 0.11 sec
41/97: speaker=林美慧#0
	RR for 林美慧#0 = 100.00%, ave. time = 0.12 sec
42/97: speaker=梁啟輝#1
	RR for 梁啟輝#1 = 100.00%, ave. time = 0.10 sec
43/97: speaker=楊子睿#1
	RR for 楊子睿#1 = 96.67%, ave. time = 0.08 sec
44/97: speaker=楊宗樺#1
	RR for 楊宗樺#1 = 93.33%, ave. time = 0.09 sec
45/97: speaker=楊惠敏#0
	RR for 楊惠敏#0 = 100.00%, ave. time = 0.12 sec
46/97: speaker=楊振緯#1
	RR for 楊振緯#1 = 80.00%, ave. time = 0.10 sec
47/97: speaker=江育儒#1
	RR for 江育儒#1 = 93.33%, ave. time = 0.10 sec
48/97: speaker=汪世婕#0
	RR for 汪世婕#0 = 96.67%, ave. time = 0.08 sec
49/97: speaker=汪緒中#1
	RR for 汪緒中#1 = 100.00%, ave. time = 0.12 sec
50/97: speaker=游鎮洋#1
	RR for 游鎮洋#1 = 100.00%, ave. time = 0.10 sec
51/97: speaker=王俊凱#1
	RR for 王俊凱#1 = 100.00%, ave. time = 0.10 sec
52/97: speaker=王小龜#1
	RR for 王小龜#1 = 96.67%, ave. time = 0.11 sec
53/97: speaker=王怡萱#0
	RR for 王怡萱#0 = 96.67%, ave. time = 0.11 sec
54/97: speaker=王瑩#0
	RR for 王瑩#0 = 73.33%, ave. time = 0.10 sec
55/97: speaker=王美玲#0
	RR for 王美玲#0 = 100.00%, ave. time = 0.11 sec
56/97: speaker=白宗儒#1
	RR for 白宗儒#1 = 100.00%, ave. time = 0.11 sec
57/97: speaker=簡嘉宏#1
	RR for 簡嘉宏#1 = 100.00%, ave. time = 0.10 sec
58/97: speaker=簡祐祥#1
	RR for 簡祐祥#1 = 43.33%, ave. time = 0.08 sec
59/97: speaker=羅尹聰#1
	RR for 羅尹聰#1 = 100.00%, ave. time = 0.11 sec
60/97: speaker=胡任桓#1
	RR for 胡任桓#1 = 100.00%, ave. time = 0.11 sec
61/97: speaker=葉子雋#1
	RR for 葉子雋#1 = 100.00%, ave. time = 0.10 sec
62/97: speaker=董姵汝#0
	RR for 董姵汝#0 = 100.00%, ave. time = 0.10 sec
63/97: speaker=蔡佩京#0
	RR for 蔡佩京#0 = 100.00%, ave. time = 0.08 sec
64/97: speaker=蔡耀陞#1
	RR for 蔡耀陞#1 = 100.00%, ave. time = 0.11 sec
65/97: speaker=薛光利#0
	RR for 薛光利#0 = 100.00%, ave. time = 0.11 sec
66/97: speaker=蘇雅雯#0
	RR for 蘇雅雯#0 = 90.00%, ave. time = 0.08 sec
67/97: speaker=衛帝安#1
	RR for 衛帝安#1 = 96.67%, ave. time = 0.12 sec
68/97: speaker=許凱華#1
	RR for 許凱華#1 = 100.00%, ave. time = 0.09 sec
69/97: speaker=許書豪#1
	RR for 許書豪#1 = 100.00%, ave. time = 0.11 sec
70/97: speaker=謝僑威#1
	RR for 謝僑威#1 = 96.67%, ave. time = 0.10 sec
71/97: speaker=賴俊龍#1
	RR for 賴俊龍#1 = 100.00%, ave. time = 0.12 sec
72/97: speaker=賴郡曄#1
	RR for 賴郡曄#1 = 83.33%, ave. time = 0.08 sec
73/97: speaker=邱莉婷#0
	RR for 邱莉婷#0 = 93.33%, ave. time = 0.12 sec
74/97: speaker=郭哲綸#1
	RR for 郭哲綸#1 = 100.00%, ave. time = 0.12 sec
75/97: speaker=郭湧鈐#1
	RR for 郭湧鈐#1 = 100.00%, ave. time = 0.10 sec
76/97: speaker=鄭宇志#1
	RR for 鄭宇志#1 = 100.00%, ave. time = 0.12 sec
77/97: speaker=鄭鈞蔚#1
	RR for 鄭鈞蔚#1 = 100.00%, ave. time = 0.08 sec
78/97: speaker=陳亦敦#1
	RR for 陳亦敦#1 = 100.00%, ave. time = 0.12 sec
79/97: speaker=陳亮宇#1
	RR for 陳亮宇#1 = 100.00%, ave. time = 0.11 sec
80/97: speaker=陳俊達#1
	RR for 陳俊達#1 = 100.00%, ave. time = 0.11 sec
81/97: speaker=陳偉豪#1
	RR for 陳偉豪#1 = 96.67%, ave. time = 0.08 sec
82/97: speaker=陳冠宇#1
	RR for 陳冠宇#1 = 100.00%, ave. time = 0.12 sec
83/97: speaker=陳威翰#1
	RR for 陳威翰#1 = 100.00%, ave. time = 0.11 sec
84/97: speaker=陳宏瑞#1
	RR for 陳宏瑞#1 = 96.67%, ave. time = 0.10 sec
85/97: speaker=陳揚昇#1
	RR for 陳揚昇#1 = 100.00%, ave. time = 0.12 sec
86/97: speaker=陳易正#1
	RR for 陳易正#1 = 96.67%, ave. time = 0.12 sec
87/97: speaker=陳朝煒#1
	RR for 陳朝煒#1 = 100.00%, ave. time = 0.09 sec
88/97: speaker=陳杰興#1
	RR for 陳杰興#1 = 96.67%, ave. time = 0.11 sec
89/97: speaker=陳永強#1
	RR for 陳永強#1 = 96.67%, ave. time = 0.10 sec
90/97: speaker=高佳慧#0
	RR for 高佳慧#0 = 100.00%, ave. time = 0.12 sec
91/97: speaker=魏宇晨#0
	RR for 魏宇晨#0 = 100.00%, ave. time = 0.11 sec
92/97: speaker=黃昌傑#1
	RR for 黃昌傑#1 = 100.00%, ave. time = 0.11 sec
93/97: speaker=黃永漢#1
	RR for 黃永漢#1 = 100.00%, ave. time = 0.08 sec
94/97: speaker=黃秀惠#0
	RR for 黃秀惠#0 = 86.67%, ave. time = 0.08 sec
95/97: speaker=黃羿銘#1
	RR for 黃羿銘#1 = 100.00%, ave. time = 0.12 sec
96/97: speaker=黃韋中#1
	RR for 黃韋中#1 = 83.33%, ave. time = 0.11 sec
97/97: speaker=龍慧容#0
	RR for 龍慧容#0 = 90.00%, ave. time = 0.10 sec
Ovderall RR = 96.22%
Elapsed time = 303.726 sec

Post analysis

We can display the scatter plots based on the features of each sentence, in order to visualize if it is possible to separate "bad" utterances from "good" ones:

sentence=[speakerSet1.sentence];
DS.input=[[sentence.meanVolume]; [sentence.meanClarity]; [sentence.medianPitch]; [sentence.minDistance]; [sentence.frameNum]];
DS.inputName={'meanVolume', 'meanClarity', 'medianPitch', 'minDistance', 'frameNum'};
DS.input=inputNormalize(DS.input);
DS.output=2-[sentence.correct];
dsProjPlot2(DS); figEnlarge;
eval(sprintf('print -dpng %s/dtwDataDistribution', sidOpt.outputDir));

Summary

This is a brief tutorial on text-dependent speaker recognition based on DTW. There are several directions for further improvement:

Appendix

List of functions used in this script

Overall elapsed time:

toc(scriptStartTime)
Elapsed time is 1542.796783 seconds.

Jyh-Shing Roger Jang, created on

datetime
ans = 

   24-May-2017 23:06:52

If you are interested in the original MATLAB code for this page, you can type "grabcode(URL)" under MATLAB, where URL is the web address of this page.