english version
(請注意:中文版本並未隨英文版本同步更新! )
如果我們將每一筆資料的各個特徵(feature)視為它的座標,那麼我們就可以將全部的資料視為一群分佈在高維度空間中的點,而每筆資料的特徵數目便可視為該筆資料的維度。舉例來說,倘若我們現在擁有三百筆均帶有三個特徵的資料,我們便可將這些資料視為300個在三度空間中的點:
Example 1: inputExtraction01.m DS=prData('random3');
subplot(1,2,1); dsScatterPlot3(DS);
view(155, 46);
subplot(1,2,2); dsScatterPlot3(DS);
view(-44, 22);
在上圖中,左圖和右圖有相同的300筆資料,每筆資料均含有三個特徵。左圖的資料看似雜亂無章,這是因為我們將 3D 的圖形投影到 2D 的平面上的緣故。只要我們稍稍轉動我們的視角,便可清楚地看出這 300 筆資料分屬於三個不同的群組,如右圖。而如何找到一個適當視角,來將資料投影到低維度的空間,以使得不同類別的資料能夠散得越開越好,就是特徵粹取的目標。
Hint 在上述範例中,讀者可在 MATLAB 圖軸進行滑鼠的拖放,即可看到視角的改變,很有趣,請試試看!
我們也可以使用數學的方式來說明特徵選取。假設我們有一組特徵 V = {v1 , v2 ,... , vd },我們希望透過某種分類器來得到辨識率 J(˙),此辨識率是所選取特徵的函數;而特徵粹取的目標,就是要找出一組鑑別能力最佳的的集合 S,使得 J(S)≧J(T),其中 S 與 T 的元素均由 V 經由線性組合而成。
首先,我們將前一個範例的 3D 資料點,投影到 2D 平面上,就可以看出 LDA 的效果,如下:
Example 2: ldaRandom301.m DS=prData('random3');
DS2=lda(DS); DS2.input=DS2.input(1:2, :);
DS3=lda(DS); DS3.input=DS3.input(end-1:end, :);
subplot(2,1,1); dsScatterPlot(DS2); axis image
xlabel('Input 1'); ylabel('Input 2');
title('Projected to the first two dim of LDA');
subplot(2,1,2); dsScatterPlot(DS3); axis image
xlabel('Input 3'); ylabel('Input 4');
title('Projected to the last two dim of LDA');
當我們投影到 LDA 的前兩個維度,類別之間的區分很明顯,這就是 LDA 的效果。
在下面範例中,是我們針對 IRIS 資料進行線性識別分析:
Example 3: ldaIris2d01.m DS = prData('iris');
dataNum = size(DS.input, 2);
DS2 = lda(DS);
% ====== Projection to the first two eigenvectors
DS3=DS2; DS3.input=DS2.input(1:2, :);
subplot(2,1,1);
[recogRate, computed] = knncLoo(DS3, [], 1);
title(sprintf('LDA projection of %s data onto the first 2 discriminant vectors', DS.dataName));
xlabel(sprintf('KNNC''s leave-one-out recog. rate = %d/%d = %g%%', sum(DS3.output==computed), dataNum, 100*recogRate));
% ====== Projection to the last two eigenvectors
DS3=DS2; DS3.input=DS2.input(end-1:end, :);
subplot(2,1,2);
[recogRate, computed] = knncLoo(DS3, [], 1);
title(sprintf('LDA projection of %s data onto the last 2 discriminant vectors', DS.dataName));
xlabel(sprintf('KNNC''s leave-one-out recog. rate = %d/%d = %g%%', sum(DS3.output==computed), dataNum, 100*recogRate));
在上述範例中,我們採用 1-nearest-neighbor classifier 以及leave-one-out 來進行辨識率的測試。在第一個圖中,我們將 150 筆 IRIS 資料投影於第一組和第二組識別向量組成的平面上,可得到 98.00% 的辨識率,相當於只有 3 個資料點的分類錯誤。若投影於第三組和第四組識別向量組成的平面上,可得到第二個圖,可見不同類別的資料點有相當大的重疊部分,因此辨識率也較低,只有 87.33%,相當於 19 個分類錯誤的資料點。(在上圖中,所有分類錯誤的資料點都是以黑色叉叉來表示。)
若使用 WINE 資料來進行 LDA 投影,得到的效果如下:
Example 4: ldaWine2d01.m DS = prData('wine');
dataNum = size(DS.input, 2);
DS2 = lda(DS);
% ====== Projection to the first two eigenvectors
DS3=DS2; DS3.input=DS2.input(1:2, :);
subplot(2,1,1);
[recogRate, computed] = knncLoo(DS3, [], 1);
title(sprintf('LDA projection of %s data onto the first 2 discriminant vectors', DS.dataName));
xlabel(sprintf('KNNC''s leave-one-out recog. rate = %d/%d = %g%%', sum(DS3.output==computed), dataNum, 100*recogRate));
% ====== Projection to the last two eigenvectors
DS3=DS2; DS3.input=DS2.input(end-1:end, :);
subplot(2,1,2);
[recogRate, computed] = knncLoo(DS3, [], 1);
title(sprintf('LDA projection of %s data onto the last 2 discriminant vectors', DS.dataName));
xlabel(sprintf('KNNC''s leave-one-out recog. rate = %d/%d = %g%%', sum(DS3.output==computed), dataNum, 100*recogRate));
同樣我們可以看出,投影到前二維的資料都會有比較好的類別分開程度。
若以 KNNC 及 leave-one-out 來測試 LDA 投影的維度對辨識率的影響,可使用下列範例程式來測試 IRIS 資料:
Example 5: ldaIrisDim01.m DS=prData('iris');
[featureNum, dataNum] = size(DS.input);
[recogRate, computed] = knncLoo(DS);
fprintf('All data ===> LOO recog. rate = %d/%d = %g%%\n', sum(DS.output==computed), dataNum, 100*recogRate);
DS2 = lda(DS);
recogRate=[];
for i = 1:featureNum
DS3=DS2; DS3.input=DS3.input(1:i, :);
[recogRate(i), computed] = knncLoo(DS3);
fprintf('LDA dim = %d ===> LOO recog. rate = %d/%d = %g%%\n', i, sum(DS3.output==computed), dataNum, 100*recogRate(i));
end
plot(1:featureNum, 100*recogRate, 'o-'); grid on
xlabel('No. of projected features based on LDA');
ylabel('LOO recognition rates using KNNC (%)'); All data ===> LOO recog. rate = 144/150 = 96%
LDA dim = 1 ===> LOO recog. rate = 143/150 = 95.3333%
LDA dim = 2 ===> LOO recog. rate = 147/150 = 98%
LDA dim = 3 ===> LOO recog. rate = 142/150 = 94.6667%
LDA dim = 4 ===> LOO recog. rate = 144/150 = 96%
有此也可以看出,並非特徵個數越多,辨識率越好。以上例而言,最好的辨識率是 98.00%,發生在投影到二度空間時,對應的混淆矩陣如下:
Example 6: ldaIrisConf01.m DS=prData('iris');
DS2 = lda(DS);
DS3=DS2; DS3.input=DS3.input(1:2, :);
[recogRate, computedOutput] = knncLoo(DS3);
confMat=confMatGet(DS3.output, computedOutput);
opt=confMatPlot('defaultOpt');
opt.className=DS.outputName;
confMatPlot(confMat, opt);
在上述範例中,若左圖為 A 矩陣,右圖為 B 矩陣,則 A(i,j) 代表第 i 類被分到第 j 類的個數,而 B(i, j) 則代表第 i 類被分到第 j 類的機率,滿足 B(i, 1) + B(i, 2) + B(i, 3) = 100% for all i。
若以相同的方式來測試 wine 資料,結果如下:
Example 7: ldaWineDim01.m DS=prData('wine');
[featureNum, dataNum] = size(DS.input);
[recogRate, computed] = knncLoo(DS);
fprintf('All data ===> LOO recog. rate = %d/%d = %g%%\n', sum(DS.output==computed), dataNum, 100*recogRate);
DS2 = lda(DS);
recogRate=[];
for i = 1:featureNum
DS3=DS2; DS3.input=DS3.input(1:i, :);
[recogRate(i), computed] = knncLoo(DS3);
fprintf('LDA dim = %d ===> LOO recog. rate = %d/%d = %g%%\n', i, sum(DS3.output==computed), dataNum, 100*recogRate(i));
end
plot(1:featureNum, 100*recogRate, 'o-'); grid on
xlabel('No. of projected features based on LDA');
ylabel('LOO recognition rates using KNNC (%)'); All data ===> LOO recog. rate = 137/178 = 76.9663%
LDA dim = 1 ===> LOO recog. rate = 168/178 = 94.382%
LDA dim = 2 ===> LOO recog. rate = 168/178 = 94.382%
LDA dim = 3 ===> LOO recog. rate = 168/178 = 94.382%
LDA dim = 4 ===> LOO recog. rate = 173/178 = 97.191%
LDA dim = 5 ===> LOO recog. rate = 174/178 = 97.7528%
LDA dim = 6 ===> LOO recog. rate = 175/178 = 98.3146%
LDA dim = 7 ===> LOO recog. rate = 172/178 = 96.6292%
LDA dim = 8 ===> LOO recog. rate = 173/178 = 97.191%
LDA dim = 9 ===> LOO recog. rate = 170/178 = 95.5056%
LDA dim = 10 ===> LOO recog. rate = 168/178 = 94.382%
LDA dim = 11 ===> LOO recog. rate = 159/178 = 89.3258%
LDA dim = 12 ===> LOO recog. rate = 143/178 = 80.3371%
LDA dim = 13 ===> LOO recog. rate = 137/178 = 76.9663%
以上例而言,最好的辨識率是 98.31%,發生在投影到六度空間時,對應的混淆矩陣如下:
Example 8: ldaWineConf01.m DS=prData('wine');
DS2 = lda(DS);
DS3=DS2; DS3.input=DS3.input(1:6, :);
[recogRate, computedOutput] = knncLoo(DS3);
confMat=confMatGet(DS3.output, computedOutput);
confMatPlot(confMat);
若我們將上述範例寫成一個函數 ldaPerfViaKnncLoo.m,則可用此函數來測試「資料正規化」對於辨識率的影響,請見下列範例:
Example 9: ldaWineDim02.m DS=prData('wine');
recogRate1=ldaPerfViaKnncLoo(DS);
DS2=DS; DS2.input=inputNormalize(DS2.input); % input normalization
recogRate2=ldaPerfViaKnncLoo(DS2);
[featureNum, dataNum] = size(DS.input);
plot(1:featureNum, 100*recogRate1, 'o-', 1:featureNum, 100*recogRate2, '^-'); grid on
legend({'Raw data', 'Normalized data'}, 'location', 'northOutside', 'orientation', 'horizontal');
xlabel('No. of projected features based on LDA');
ylabel('LOO recognition rates using KNNC (%)');
有上述範例可以看出,對於這個應用而言,使用了資料正規化,使得辨識率提升很多,尤其是在資料維度是 6 時,辨識率可達 100%。
Hint 資料正規化對於辨識率的影響,隨不同的應用與不同的分類器而變,並非一定都可以提升辨識率。
Hint 其實,本章所得到的辨識率,會有一點點流於樂觀。因為我們是用所有的資料來進行 LDA,然後再做 KNNC 的 LOO 辨識率測試。換句話說,LDA 已經「偷看」了所有的資料,所以這個測試所得到的辨識率會偏高一點點。
如果資料維度大於資料筆數,則 LDA 通常會出現錯誤的結果,因為運算過程中,數個矩陣可能接近 Singular,導致計算出來的 eigenvalues 有複數。一般解決的方案,事先用 PCA 投影到低維空間,以便盡量維持資料的空間距離特性,然後再使用 LDA 求取對於分類最佳的投影方式。
References:
J. Duchene and S. Leclercq, "An optimal transformation for discriminant and principal component analysis", IEEE Trans. PAMI, vol. 10, pp.978-983, 1988
More info:
Appendix:
How to prove $T=W+B$ in the crisp case:
$$
\begin{array}{rcl}
T & = & \sum_{i=1}^n (a_i-\mu)(a_i-\mu)^T\\
& = & \sum_{i=1}^n (a_i a_i^T - \mu\mu^T)\\
& = & \sum_{i=1}^n a_i a_i^T - n \mu \mu^T\\
& = & AA^T-n\mu\mu^T\\
\end{array}
$$
$$
\begin{array}{rcl}
W & = & \sum_{k=1}^c w_k\\
& = & \sum_{k=1}^c (A_k A_k^T - n_k \mu_k \mu_k^T)\\
& = & \sum_{k=1}^c A_k A_k^T - \sum_{k=1}^c n_k \mu_k \mu_k^T\\
& = & AA^T- \sum_{k=1}^c n_k \mu_k \mu_k^T\\
\end{array}
$$
$$
\begin{array}{rcl}
B & = & \sum_{k=1}^c n_k (\mu_k-\mu)(\mu_k-\mu)^T\\
& = & \sum_{k=1}^c n_k \mu_k \mu_k^T - n\mu\mu^T\\
\end{array}
$$
Thus we have
$$
T=W+B.
$$
How to prove $T=W+B$ in the fuzzy case:
Define
$u_{k,i}$ = degree of point $i$ in class $k$.
$\alpha_k = \sum_{i=1}^n u_{k,i}$ = size of class k.
Then
$$
\mu_k = \frac{\sum_{i=1}^n u_{k,i} a_i}{\sum_{i=1}^n u_{k,i}} = \frac{\sum_{i=1}^n u_{k,i} a_i}{\alpha_k} \rightarrow \alpha_k \mu_k = \sum_{i=1}^n u_{k,i}a_i.
$$
Moreover, we have the following identities:
$$
\left\{
\begin{array}{l}
\sum_{k=1}^c u_{k,i} = 1.\\
\sum_{k=1}^c \alpha_k = \sum_{i=1}^n \sum_{k=1}^c u_{k,i} = n.\\
\sum_{k=1}^c \alpha_k \mu_k = \sum_{i=1}^n (\sum_{k=1}^c u_{k,i}) a_i = \sum_{i=1}^n a_i = n \mu.\\
\end{array}
\right.
$$
Using the above identities, we have
$$
\begin{array}{rcl}
T & = & \sum_{i=1}^n (a_i-\mu)(a_i-\mu)^T\\
& = & \sum_{i=1}^n (a_i a_i^T - \mu\mu^T)\\
& = & \sum_{i=1}^n a_i a_i^T - n \mu \mu^T\\
& = & AA^T-n\mu\mu^T\\
\end{array}
$$
$$
\begin{array}{rcl}
W_k & = & \sum_{i=1}^n u_{k,i} (a_i-\mu_k)(a_i-\mu_k)^T\\
& = & \sum_{i=1}^n u_{k,i} a_i a_i^T - (\sum_{i=1}^n u_{k,i} a_i) \mu_k^T - \mu_k ({\sum_{i=1}^n u_{k,i}} a_i^T) + \sum_{i=1}^n u_{k,i}\mu_k \mu_k^T\\
& = & \sum_{i=1}^n u_{k,i} a_i a_i^T - (\alpha_k \mu_k)\mu_k^T - \mu_k (\alpha_k \mu_k^T) + \sum_{i=1}^n u_{k,i}\mu_k \mu_k^T\\
& = & \sum_{i=1}^n u_{k,i} a_i a_i^T - \alpha_k \mu_k \mu_k^T\\
\end{array}
$$
$$
\begin{array}{rcl}
W & = & \sum_{k=1}^c w_k\\
& = & \sum_{i=1}^n \sum_{k=1}^c u_{k,i} a_i a_i^T - \sum_{k=1}^c \alpha_k \mu_k \mu_k^T\\
& = & \sum_{i=1}^n a_i a_i^T - \sum_{k=1}^c \alpha_k \mu_k \mu_k^T\\
\end{array}
$$
$$
\begin{array}{rcl}
B & = & \sum_{k=1}^c \alpha_k (\mu_k-\mu)(\mu_k-\mu)^T\\
& = & \sum_{k=1}^c \alpha_k \mu \mu^T - (\sum_{k=1}^c \alpha_k \mu_k) \mu^T - \mu (\sum_{k=1}^c \alpha_k \mu_k)^T + (\sum_{k=1}^c \alpha_k) \mu \mu^T\\
& = & \sum_{k=1}^c \alpha_k \mu \mu^T - (n \mu) \mu^T - \mu (n \mu)^T + (n) \mu \mu^T\\
& = & \sum_{k=1}^c \alpha_k \mu_k \mu_k^T - n\mu\mu^T\\
\end{array}
$$
Thus we have $T=W+B$ in the fuzzy case where each data point belong to a class to a certain degree.
Data Clustering and Pattern Recognition (資料分群與樣式辨認)