In the previous section, we have demonstrated how to use DTW for text-dependent speaker identification. Here the distance between two utterances is measured by the DTW between two sequences of MFCC from two utterances. It is possible to further improve the accuracy by combining various distance measures. Here in this section we shall demonstrate how to use a classifier on these various distance measures to further improve the accuracy. The procedure is described by the following steps:
To read the data from the folder, try:
Example 1: speakerIdTextDependent02/goFeaExtract.m % Feature extraction
sidPrm=sidPrmSet;
% ====== Read session 1
speakerData1=speakerDataRead(sidPrm.waveDir01);
fprintf('Get wave info of %d persons from %s\n', length(speakerData1), sidPrm.waveDir01);
speakerData1=speakerDataAddFea(speakerData1, sidPrm); % Add features to speakerData1
% ====== Read session 2
speakerData2=speakerDataRead(sidPrm.waveDir02);
fprintf('Get wave info of %d persons from %s\n', length(speakerData2), sidPrm.waveDir02);
speakerData2=speakerDataAddFea(speakerData2, sidPrm); % Add features to speakerData2
fprintf('Save speakerData1 and speakerData2 to speakerData.mat\n');
save speakerData speakerData1 speakerData2 Get wave info of 3 persons from \users\jang\matlab\toolbox\dcpr\dataSet\speakerIdTextDependent\session01
1/3: Feature extraction from 30 recordings by 9761215 ===> 0.335829 sec
2/3: Feature extraction from 30 recordings by 9761217 ===> 0.274542 sec
3/3: Feature extraction from 30 recordings by 9762115 ===> 0.355193 sec
Get wave info of 3 persons from \users\jang\matlab\toolbox\dcpr\dataSet\speakerIdTextDependent\session02
1/3: Feature extraction from 30 recordings by 9761215 ===> 0.362983 sec
2/3: Feature extraction from 30 recordings by 9761217 ===> 0.360011 sec
3/3: Feature extraction from 30 recordings by 9762115 ===> 0.319486 sec
Save speakerData1 and speakerData2 to speakerData.mat
To create the training data, try the next example to create DS.mat:
Example 2: speakerIdTextDependent02/goTrainSet.m % Performance evaluation
load speakerData.mat
sidPrm=sidPrmSet;
sentenceNum=length([speakerData2.sentence]);
for p=1:length(speakerData1)
for q=1:length(speakerData1(p).sentence)
speakerData1(p).sentence(q).dsInput=[];
speakerData1(p).sentence(q).dsOutput=[];
end
end
DS.input=[];
DS.output=[];
% ====== Speaker ID by dtw1, dtw2, distLinScaling
for i=1:length(speakerData2)
tInit=clock;
name=speakerData2(i).name;
fprintf('%d/%d: speaker=%s\n', i, length(speakerData2), name);
for j=1:length(speakerData2(i).sentence)
% fprintf('\tsentence=%d ==> ', j);
% t0=clock;
inputSentence=speakerData2(i).sentence(j);
for p=1:length(speakerData1)
for q=1:length(speakerData1(p).sentence)
% === Collect DS.input
k=size(speakerData1(p).sentence(q).dsInput, 2)+1;
speakerData1(p).sentence(q).dsInput(1, k)=dtw1(inputSentence.fea, speakerData1(p).sentence(q).fea, 1, 1);
speakerData1(p).sentence(q).dsInput(2, k)=dtw2(inputSentence.fea, speakerData1(p).sentence(q).fea, 1, 1);
speakerData1(p).sentence(q).dsInput(3, k)=distLinScaling(inputSentence.fea, speakerData1(p).sentence(q).fea);
% speakerData1(p).sentence(q).dsInput(4, k)=dtw1(inputSentence.vol, speakerData1(p).sentence(q).vol, 1, 1);
% speakerData1(p).sentence(q).dsInput(5, k)=dtw2(inputSentence.vol, speakerData1(p).sentence(q).vol, 1, 1);
% speakerData1(p).sentence(q).dsInput(6, k)=distLinScaling(inputSentence.vol, speakerData1(p).sentence(q).vol);
% === Collect DS.output
speakerData1(p).sentence(q).dsOutput(1, k)=1+strcmp(speakerData2(i).sentence(j).text, speakerData1(p).sentence(q).text);
% fprintf('q=%d, text1=%s, text2=%s, output=%d\n', q, speakerData2(i).sentence(j).text, speakerData1(p).sentence(q).text, speakerData1(p).sentence(q).dsOutput(1, k)); pause
end
end
% fprintf(' Name = %s, ave. time = %.2f sec\n', speakerData2(i).name, etime(clock, t0)/length(speakerData2(i).sentence));
end
% speakerData2(i).correct=[speakerData2(i).sentence.correct];
% speakerData2(i).rr=sum(speakerData2(i).correct)/length(speakerData2(i).correct);
% fprintf('\tAve. time = %.2f sec\n', etime(clock, tInit)/length(speakerData2(i).sentence));
end
allSentences=[speakerData1.sentence];
DS.input=cat(2, allSentences.dsInput);
DS.output=cat(2, allSentences.dsOutput);
DS.input(DS.input>2e9)=inf;
index1=find(isinf(DS.input(1,:)));
index2=find(isinf(DS.input(2,:)));
index=union(index1, index2);
DS.input(:,index)=[];
DS.output(:, index)=[];
%dsScatterPlot(DS);
%dsClassSize(DS, 1);
fprintf('Saving DS.mat...\n');
save DS DS
return
% ====== Linear classifier
trainPrm=lincTrainPrmSet('method', 'batchLearning', 'animation', 'yes', 'printInterval', 30);
[coef, recogRate]=lincTrain(DS, trainPrm);
fprintf('Recog. rate = %.2f%%\n', 100*recogRate);
% ====== GMM classifier
TS=DS;
DS.input(:, 2:2:end)=[];
DS.output(:,2:2:end)=[];
TS.input(:, 1:2:end)=[];
TS.output(:,1:2:end)=[];
[DS.input, mu, sigma]=inputNormalize(DS.input); % Input normalization for DS
TS.input=inputNormalize(TS.input, mu, sigma); % Input normalization for TS
count1=dsClassSize(DS); count2=dsClassSize(TS);
vecOfGaussianNum=1:min([count1, count2]);
covType=1;
gmmTrainPrm=gmmTrainPrmSet; gmmTrainPrm.plotOpt=1;
[gmmData, recogRate1, recogRate2]=gmmcTrainEvalWrtGaussianNum(DS, TS, vecOfGaussianNum, covType, gmmTrainPrm);
1/3: speaker=9761215
2/3: speaker=9761217
3/3: speaker=9762115
Saving DS.mat...
To design a quadratic classifier for the training set, try the next example which creates a quadratic classifier with parameters stored in model.mat:
Example 3: speakerIdTextDependent02/goClassifierDesign.m % Performance evaluation
load DS.mat
prior=dsClassSize(DS); % Use the class size as the class prior probability
[qcPrm, recogRate]=qcTrain(DS, prior);
fprintf('Recognition rate = %f%%\n', recogRate*100);
model=qcPrm;
fprintf('Saving classifier''s parameters...\n');
save model model
Recognition rate = 96.480900%
Saving classifier's parameters...
To evaluate the performance using the extra classifier, try the next example:
Example 4: speakerIdTextDependent02/goPerfEval.m % Performance evaluation
load speakerData.mat
load model.mat
% ====== Speaker ID by DTW
for i=1:length(speakerData2)
tInit=clock;
name=speakerData2(i).name;
fprintf('%d/%d: speaker=%s\n', i, length(speakerData2), name);
for j=1:length(speakerData2(i).sentence)
% fprintf('\tsentence=%d ==> ', j);
% t0=clock;
inputSentence=speakerData2(i).sentence(j);
[speakerIndex, sentenceIndex, minDistance]=speakerId(inputSentence, speakerData1, model);
computedName=speakerData1(speakerIndex).name;
% fprintf('computedName=%s, time=%.2f sec\n', computedName, etime(clock, t0));
speakerData2(i).sentence(j).correct=strcmp(name, computedName);
speakerData2(i).sentence(j).computedSpeakerIndex=speakerIndex;
speakerData2(i).sentence(j).computedSentenceIndex=sentenceIndex;
speakerData2(i).sentence(j).computedSentencePath=speakerData1(speakerIndex).sentence(sentenceIndex).path;
end
speakerData2(i).correct=[speakerData2(i).sentence.correct];
speakerData2(i).rr=sum(speakerData2(i).correct)/length(speakerData2(i).correct);
fprintf('\tRR for %s = %.2f%%, ave. time = %.2f sec\n', name, 100*speakerData2(i).rr, etime(clock, tInit)/length(speakerData2(i).sentence));
end
correct=[speakerData2.correct];
overallRr=sum(correct)/length(correct);
fprintf('Ovderall RR = %.2f%%\n', 100*overallRr);
fprintf('Save speakerData1 and speakerData2 to speakerData.mat\n');
save speakerData speakerData1 speakerData2 1/3: speaker=9761215
RR for 9761215 = 90.00%, ave. time = 0.13 sec
2/3: speaker=9761217
RR for 9761217 = 90.00%, ave. time = 0.13 sec
3/3: speaker=9762115
RR for 9762115 = 100.00%, ave. time = 0.12 sec
Ovderall RR = 93.33%
Save speakerData1 and speakerData2 to speakerData.mat
After obtaining the overall recognition rate, we can compute statistics of each person, and also list the misclassified utterances with their false output:
Example 5: speakerIdTextDependent02/goPostAnalysis.m sidPrm=sidPrmSet;
load speakerData.mat
correct=[speakerData2.correct];
overallRr=sum(correct)/length(correct);
% ====== Display each person's performance
[junk, index]=sort([speakerData2.rr]);
sortedSpeakerData2=speakerData2(index);
outputFile=sprintf('%s/personRr_rr=%f%%.htm', sidPrm.outputDir, 100*overallRr);
structDispInHtml(sortedSpeakerData2, sprintf('Performance of all persons (Overall RR=%.2f%%)', 100*overallRr), {'name', 'rr'}, [], [], outputFile);
% ====== Display misclassified utterances
sentenceData=[sortedSpeakerData2.sentence];
sentenceDataMisclassified=sentenceData(~[sentenceData.correct]);
outputFile=sprintf('%s/sentenceMisclassified_rr=%f%%.htm', sidPrm.outputDir, 100*overallRr);
structDispInHtml(sentenceDataMisclassified, sprintf('Misclassified Sentences (Overall RR=%.2f%%)', 100*overallRr), {'path', 'computedSentencePath'}, [], [], outputFile);
Data Clustering and Pattern Recognition (資料分群與樣式辨認)