8-6 DTW for Speaker Identification: Further Enhancement

In the previous section, we have demonstrated how to use DTW for text-dependent speaker identification. Here the distance between two utterances is measured by the DTW between two sequences of MFCC from two utterances. It is possible to further improve the accuracy by combining various distance measures. Here in this section we shall demonstrate how to use a classifier on these various distance measures to further improve the accuracy. The procedure is described by the following steps:

  1. To read the data from the folder, try:

    Example 1: speakerIdTextDependent02/goFeaExtract.m% Feature extraction sidPrm=sidPrmSet; % ====== Read session 1 speakerData1=speakerDataRead(sidPrm.waveDir01); fprintf('Get wave info of %d persons from %s\n', length(speakerData1), sidPrm.waveDir01); speakerData1=speakerDataAddFea(speakerData1, sidPrm); % Add features to speakerData1 % ====== Read session 2 speakerData2=speakerDataRead(sidPrm.waveDir02); fprintf('Get wave info of %d persons from %s\n', length(speakerData2), sidPrm.waveDir02); speakerData2=speakerDataAddFea(speakerData2, sidPrm); % Add features to speakerData2 fprintf('Save speakerData1 and speakerData2 to speakerData.mat\n'); save speakerData speakerData1 speakerData2Get wave info of 3 persons from \users\jang\matlab\toolbox\dcpr\dataSet\speakerIdTextDependent\session01 1/3: Feature extraction from 30 recordings by 9761215 ===> 0.335829 sec 2/3: Feature extraction from 30 recordings by 9761217 ===> 0.274542 sec 3/3: Feature extraction from 30 recordings by 9762115 ===> 0.355193 sec Get wave info of 3 persons from \users\jang\matlab\toolbox\dcpr\dataSet\speakerIdTextDependent\session02 1/3: Feature extraction from 30 recordings by 9761215 ===> 0.362983 sec 2/3: Feature extraction from 30 recordings by 9761217 ===> 0.360011 sec 3/3: Feature extraction from 30 recordings by 9762115 ===> 0.319486 sec Save speakerData1 and speakerData2 to speakerData.mat

  2. To create the training data, try the next example to create DS.mat:

    Example 2: speakerIdTextDependent02/goTrainSet.m% Performance evaluation load speakerData.mat sidPrm=sidPrmSet; sentenceNum=length([speakerData2.sentence]); for p=1:length(speakerData1) for q=1:length(speakerData1(p).sentence) speakerData1(p).sentence(q).dsInput=[]; speakerData1(p).sentence(q).dsOutput=[]; end end DS.input=[]; DS.output=[]; % ====== Speaker ID by dtw1, dtw2, distLinScaling for i=1:length(speakerData2) tInit=clock; name=speakerData2(i).name; fprintf('%d/%d: speaker=%s\n', i, length(speakerData2), name); for j=1:length(speakerData2(i).sentence) % fprintf('\tsentence=%d ==> ', j); % t0=clock; inputSentence=speakerData2(i).sentence(j); for p=1:length(speakerData1) for q=1:length(speakerData1(p).sentence) % === Collect DS.input k=size(speakerData1(p).sentence(q).dsInput, 2)+1; speakerData1(p).sentence(q).dsInput(1, k)=dtw1(inputSentence.fea, speakerData1(p).sentence(q).fea, 1, 1); speakerData1(p).sentence(q).dsInput(2, k)=dtw2(inputSentence.fea, speakerData1(p).sentence(q).fea, 1, 1); speakerData1(p).sentence(q).dsInput(3, k)=distLinScaling(inputSentence.fea, speakerData1(p).sentence(q).fea); % speakerData1(p).sentence(q).dsInput(4, k)=dtw1(inputSentence.vol, speakerData1(p).sentence(q).vol, 1, 1); % speakerData1(p).sentence(q).dsInput(5, k)=dtw2(inputSentence.vol, speakerData1(p).sentence(q).vol, 1, 1); % speakerData1(p).sentence(q).dsInput(6, k)=distLinScaling(inputSentence.vol, speakerData1(p).sentence(q).vol); % === Collect DS.output speakerData1(p).sentence(q).dsOutput(1, k)=1+strcmp(speakerData2(i).sentence(j).text, speakerData1(p).sentence(q).text); % fprintf('q=%d, text1=%s, text2=%s, output=%d\n', q, speakerData2(i).sentence(j).text, speakerData1(p).sentence(q).text, speakerData1(p).sentence(q).dsOutput(1, k)); pause end end % fprintf(' Name = %s, ave. time = %.2f sec\n', speakerData2(i).name, etime(clock, t0)/length(speakerData2(i).sentence)); end % speakerData2(i).correct=[speakerData2(i).sentence.correct]; % speakerData2(i).rr=sum(speakerData2(i).correct)/length(speakerData2(i).correct); % fprintf('\tAve. time = %.2f sec\n', etime(clock, tInit)/length(speakerData2(i).sentence)); end allSentences=[speakerData1.sentence]; DS.input=cat(2, allSentences.dsInput); DS.output=cat(2, allSentences.dsOutput); DS.input(DS.input>2e9)=inf; index1=find(isinf(DS.input(1,:))); index2=find(isinf(DS.input(2,:))); index=union(index1, index2); DS.input(:,index)=[]; DS.output(:, index)=[]; %dsScatterPlot(DS); %dsClassSize(DS, 1); fprintf('Saving DS.mat...\n'); save DS DS return % ====== Linear classifier trainPrm=lincTrainPrmSet('method', 'batchLearning', 'animation', 'yes', 'printInterval', 30); [coef, recogRate]=lincTrain(DS, trainPrm); fprintf('Recog. rate = %.2f%%\n', 100*recogRate); % ====== GMM classifier TS=DS; DS.input(:, 2:2:end)=[]; DS.output(:,2:2:end)=[]; TS.input(:, 1:2:end)=[]; TS.output(:,1:2:end)=[]; [DS.input, mu, sigma]=inputNormalize(DS.input); % Input normalization for DS TS.input=inputNormalize(TS.input, mu, sigma); % Input normalization for TS count1=dsClassSize(DS); count2=dsClassSize(TS); vecOfGaussianNum=1:min([count1, count2]); covType=1; gmmTrainPrm=gmmTrainPrmSet; gmmTrainPrm.plotOpt=1; [gmmData, recogRate1, recogRate2]=gmmcTrainEvalWrtGaussianNum(DS, TS, vecOfGaussianNum, covType, gmmTrainPrm); 1/3: speaker=9761215 2/3: speaker=9761217 3/3: speaker=9762115 Saving DS.mat...

  3. To design a quadratic classifier for the training set, try the next example which creates a quadratic classifier with parameters stored in model.mat:

    Example 3: speakerIdTextDependent02/goClassifierDesign.m% Performance evaluation load DS.mat prior=dsClassSize(DS); % Use the class size as the class prior probability [qcPrm, recogRate]=qcTrain(DS, prior); fprintf('Recognition rate = %f%%\n', recogRate*100); model=qcPrm; fprintf('Saving classifier''s parameters...\n'); save model model Recognition rate = 96.480900% Saving classifier's parameters...

    To evaluate the performance using the extra classifier, try the next example:

    Example 4: speakerIdTextDependent02/goPerfEval.m% Performance evaluation load speakerData.mat load model.mat % ====== Speaker ID by DTW for i=1:length(speakerData2) tInit=clock; name=speakerData2(i).name; fprintf('%d/%d: speaker=%s\n', i, length(speakerData2), name); for j=1:length(speakerData2(i).sentence) % fprintf('\tsentence=%d ==> ', j); % t0=clock; inputSentence=speakerData2(i).sentence(j); [speakerIndex, sentenceIndex, minDistance]=speakerId(inputSentence, speakerData1, model); computedName=speakerData1(speakerIndex).name; % fprintf('computedName=%s, time=%.2f sec\n', computedName, etime(clock, t0)); speakerData2(i).sentence(j).correct=strcmp(name, computedName); speakerData2(i).sentence(j).computedSpeakerIndex=speakerIndex; speakerData2(i).sentence(j).computedSentenceIndex=sentenceIndex; speakerData2(i).sentence(j).computedSentencePath=speakerData1(speakerIndex).sentence(sentenceIndex).path; end speakerData2(i).correct=[speakerData2(i).sentence.correct]; speakerData2(i).rr=sum(speakerData2(i).correct)/length(speakerData2(i).correct); fprintf('\tRR for %s = %.2f%%, ave. time = %.2f sec\n', name, 100*speakerData2(i).rr, etime(clock, tInit)/length(speakerData2(i).sentence)); end correct=[speakerData2.correct]; overallRr=sum(correct)/length(correct); fprintf('Ovderall RR = %.2f%%\n', 100*overallRr); fprintf('Save speakerData1 and speakerData2 to speakerData.mat\n'); save speakerData speakerData1 speakerData21/3: speaker=9761215 RR for 9761215 = 90.00%, ave. time = 0.13 sec 2/3: speaker=9761217 RR for 9761217 = 90.00%, ave. time = 0.13 sec 3/3: speaker=9762115 RR for 9762115 = 100.00%, ave. time = 0.12 sec Ovderall RR = 93.33% Save speakerData1 and speakerData2 to speakerData.mat

    After obtaining the overall recognition rate, we can compute statistics of each person, and also list the misclassified utterances with their false output:

    Example 5: speakerIdTextDependent02/goPostAnalysis.msidPrm=sidPrmSet; load speakerData.mat correct=[speakerData2.correct]; overallRr=sum(correct)/length(correct); % ====== Display each person's performance [junk, index]=sort([speakerData2.rr]); sortedSpeakerData2=speakerData2(index); outputFile=sprintf('%s/personRr_rr=%f%%.htm', sidPrm.outputDir, 100*overallRr); structDispInHtml(sortedSpeakerData2, sprintf('Performance of all persons (Overall RR=%.2f%%)', 100*overallRr), {'name', 'rr'}, [], [], outputFile); % ====== Display misclassified utterances sentenceData=[sortedSpeakerData2.sentence]; sentenceDataMisclassified=sentenceData(~[sentenceData.correct]); outputFile=sprintf('%s/sentenceMisclassified_rr=%f%%.htm', sidPrm.outputDir, 100*overallRr); structDispInHtml(sentenceDataMisclassified, sprintf('Misclassified Sentences (Overall RR=%.2f%%)', 100*overallRr), {'path', 'computedSentencePath'}, [], [], outputFile);


Data Clustering and Pattern Recognition (資料分群與樣式辨認)