11-3 LDA (u悝OR)

english version (請注意：中文版本並未隨英文版本同步更新！)

如果我們將每一筆資料的各個特徵（feature）視為它的座標，那麼我們就可以將全部的資料視為一群分佈在高維度空間中的點，而每筆資料的特徵數目便可視為該筆資料的維度。舉例來說，倘若我們現在擁有三百筆均帶有三個特徵的資料，我們便可將這些資料視為300個在三度空間中的點：

Example 1: inputExtraction01.mDS=prData('random3'); subplot(1,2,1); dsScatterPlot3(DS); view(155, 46); subplot(1,2,2); dsScatterPlot3(DS); view(-44, 22);

在上圖中，左圖和右圖有相同的300筆資料，每筆資料均含有三個特徵。左圖的資料看似雜亂無章，這是因為我們將 3D 的圖形投影到 2D 的平面上的緣故。只要我們稍稍轉動我們的視角，便可清楚地看出這 300 筆資料分屬於三個不同的群組，如右圖。而如何找到一個適當視角，來將資料投影到低維度的空間，以使得不同類別的資料能夠散得越開越好，就是特徵粹取的目標。

Hint

在上述範例中，讀者可在 MATLAB 圖軸進行滑鼠的拖放，即可看到視角的改變，很有趣，請試試看！

我們也可以使用數學的方式來說明特徵選取。假設我們有一組特徵 V = {v₁ , v₂ ,... , v_d}，我們希望透過某種分類器來得到辨識率 J(˙)，此辨識率是所選取特徵的函數；而特徵粹取的目標，就是要找出一組鑑別能力最佳的的集合 S，使得 J(S)≧J(T)，其中 S 與 T 的元素均由 V 經由線性組合而成。

首先，我們將前一個範例的 3D 資料點，投影到 2D 平面上，就可以看出 LDA 的效果，如下：

Example 2: ldaRandom301.mDS=prData('random3'); DS2=lda(DS); DS2.input=DS2.input(1:2, :); DS3=lda(DS); DS3.input=DS3.input(end-1:end, :); subplot(2,1,1); dsScatterPlot(DS2); axis image xlabel('Input 1'); ylabel('Input 2'); title('Projected to the first two dim of LDA'); subplot(2,1,2); dsScatterPlot(DS3); axis image xlabel('Input 3'); ylabel('Input 4'); title('Projected to the last two dim of LDA');

當我們投影到 LDA 的前兩個維度，類別之間的區分很明顯，這就是 LDA 的效果。

在下面範例中，是我們針對 IRIS 資料進行線性識別分析：

Example 3: ldaIris2d01.mDS = prData('iris'); dataNum = size(DS.input, 2); DS2 = lda(DS); % ====== Projection to the first two eigenvectors DS3=DS2; DS3.input=DS2.input(1:2, :); subplot(2,1,1); [recogRate, computed] = knncLoo(DS3, [], 1); title(sprintf('LDA projection of %s data onto the first 2 discriminant vectors', DS.dataName)); xlabel(sprintf('KNNC''s leave-one-out recog. rate = %d/%d = %g%%', sum(DS3.output==computed), dataNum, 100*recogRate)); % ====== Projection to the last two eigenvectors DS3=DS2; DS3.input=DS2.input(end-1:end, :); subplot(2,1,2); [recogRate, computed] = knncLoo(DS3, [], 1); title(sprintf('LDA projection of %s data onto the last 2 discriminant vectors', DS.dataName)); xlabel(sprintf('KNNC''s leave-one-out recog. rate = %d/%d = %g%%', sum(DS3.output==computed), dataNum, 100*recogRate));

在上述範例中，我們採用 1-nearest-neighbor classifier 以及leave-one-out 來進行辨識率的測試。在第一個圖中，我們將 150 筆 IRIS 資料投影於第一組和第二組識別向量組成的平面上，可得到 98.00% 的辨識率，相當於只有 3 個資料點的分類錯誤。若投影於第三組和第四組識別向量組成的平面上，可得到第二個圖，可見不同類別的資料點有相當大的重疊部分，因此辨識率也較低，只有 87.33%，相當於 19 個分類錯誤的資料點。（在上圖中，所有分類錯誤的資料點都是以黑色叉叉來表示。）

若使用 WINE 資料來進行 LDA 投影，得到的效果如下：

Example 4: ldaWine2d01.mDS = prData('wine'); dataNum = size(DS.input, 2); DS2 = lda(DS); % ====== Projection to the first two eigenvectors DS3=DS2; DS3.input=DS2.input(1:2, :); subplot(2,1,1); [recogRate, computed] = knncLoo(DS3, [], 1); title(sprintf('LDA projection of %s data onto the first 2 discriminant vectors', DS.dataName)); xlabel(sprintf('KNNC''s leave-one-out recog. rate = %d/%d = %g%%', sum(DS3.output==computed), dataNum, 100*recogRate)); % ====== Projection to the last two eigenvectors DS3=DS2; DS3.input=DS2.input(end-1:end, :); subplot(2,1,2); [recogRate, computed] = knncLoo(DS3, [], 1); title(sprintf('LDA projection of %s data onto the last 2 discriminant vectors', DS.dataName)); xlabel(sprintf('KNNC''s leave-one-out recog. rate = %d/%d = %g%%', sum(DS3.output==computed), dataNum, 100*recogRate));

同樣我們可以看出，投影到前二維的資料都會有比較好的類別分開程度。

若以 KNNC 及 leave-one-out 來測試 LDA 投影的維度對辨識率的影響，可使用下列範例程式來測試 IRIS 資料：

Example 5: ldaIrisDim01.mDS=prData('iris'); [featureNum, dataNum] = size(DS.input); [recogRate, computed] = knncLoo(DS); fprintf('All data ===> LOO recog. rate = %d/%d = %g%%\n', sum(DS.output==computed), dataNum, 100*recogRate); DS2 = lda(DS); recogRate=[]; for i = 1:featureNum DS3=DS2; DS3.input=DS3.input(1:i, :); [recogRate(i), computed] = knncLoo(DS3); fprintf('LDA dim = %d ===> LOO recog. rate = %d/%d = %g%%\n', i, sum(DS3.output==computed), dataNum, 100*recogRate(i)); end plot(1:featureNum, 100*recogRate, 'o-'); grid on xlabel('No. of projected features based on LDA'); ylabel('LOO recognition rates using KNNC (%)');All data ===> LOO recog. rate = 144/150 = 96% LDA dim = 1 ===> LOO recog. rate = 143/150 = 95.3333% LDA dim = 2 ===> LOO recog. rate = 147/150 = 98% LDA dim = 3 ===> LOO recog. rate = 142/150 = 94.6667% LDA dim = 4 ===> LOO recog. rate = 144/150 = 96%

有此也可以看出，並非特徵個數越多，辨識率越好。以上例而言，最好的辨識率是 98.00%，發生在投影到二度空間時，對應的混淆矩陣如下：

Example 6: ldaIrisConf01.mDS=prData('iris'); DS2 = lda(DS); DS3=DS2; DS3.input=DS3.input(1:2, :); [recogRate, computedOutput] = knncLoo(DS3); confMat=confMatGet(DS3.output, computedOutput); opt=confMatPlot('defaultOpt'); opt.className=DS.outputName; confMatPlot(confMat, opt);

在上述範例中，若左圖為 A 矩陣，右圖為 B 矩陣，則 A(i,j) 代表第 i 類被分到第 j 類的個數，而 B(i, j) 則代表第 i 類被分到第 j 類的機率，滿足 B(i, 1) + B(i, 2) + B(i, 3) = 100% for all i。

若以相同的方式來測試 wine 資料，結果如下：

Example 7: ldaWineDim01.mDS=prData('wine'); [featureNum, dataNum] = size(DS.input); [recogRate, computed] = knncLoo(DS); fprintf('All data ===> LOO recog. rate = %d/%d = %g%%\n', sum(DS.output==computed), dataNum, 100*recogRate); DS2 = lda(DS); recogRate=[]; for i = 1:featureNum DS3=DS2; DS3.input=DS3.input(1:i, :); [recogRate(i), computed] = knncLoo(DS3); fprintf('LDA dim = %d ===> LOO recog. rate = %d/%d = %g%%\n', i, sum(DS3.output==computed), dataNum, 100*recogRate(i)); end plot(1:featureNum, 100*recogRate, 'o-'); grid on xlabel('No. of projected features based on LDA'); ylabel('LOO recognition rates using KNNC (%)');All data ===> LOO recog. rate = 137/178 = 76.9663% LDA dim = 1 ===> LOO recog. rate = 168/178 = 94.382% LDA dim = 2 ===> LOO recog. rate = 168/178 = 94.382% LDA dim = 3 ===> LOO recog. rate = 168/178 = 94.382% LDA dim = 4 ===> LOO recog. rate = 173/178 = 97.191% LDA dim = 5 ===> LOO recog. rate = 174/178 = 97.7528% LDA dim = 6 ===> LOO recog. rate = 175/178 = 98.3146% LDA dim = 7 ===> LOO recog. rate = 172/178 = 96.6292% LDA dim = 8 ===> LOO recog. rate = 173/178 = 97.191% LDA dim = 9 ===> LOO recog. rate = 170/178 = 95.5056% LDA dim = 10 ===> LOO recog. rate = 168/178 = 94.382% LDA dim = 11 ===> LOO recog. rate = 159/178 = 89.3258% LDA dim = 12 ===> LOO recog. rate = 143/178 = 80.3371% LDA dim = 13 ===> LOO recog. rate = 137/178 = 76.9663%

以上例而言，最好的辨識率是 98.31%，發生在投影到六度空間時，對應的混淆矩陣如下：

Example 8: ldaWineConf01.mDS=prData('wine'); DS2 = lda(DS); DS3=DS2; DS3.input=DS3.input(1:6, :); [recogRate, computedOutput] = knncLoo(DS3); confMat=confMatGet(DS3.output, computedOutput); confMatPlot(confMat);

若我們將上述範例寫成一個函數 ldaPerfViaKnncLoo.m，則可用此函數來測試「資料正規化」對於辨識率的影響，請見下列範例：

Example 9: ldaWineDim02.mDS=prData('wine'); recogRate1=ldaPerfViaKnncLoo(DS); DS2=DS; DS2.input=inputNormalize(DS2.input); % input normalization recogRate2=ldaPerfViaKnncLoo(DS2); [featureNum, dataNum] = size(DS.input); plot(1:featureNum, 100*recogRate1, 'o-', 1:featureNum, 100*recogRate2, '^-'); grid on legend({'Raw data', 'Normalized data'}, 'location', 'northOutside', 'orientation', 'horizontal'); xlabel('No. of projected features based on LDA'); ylabel('LOO recognition rates using KNNC (%)');

有上述範例可以看出，對於這個應用而言，使用了資料正規化，使得辨識率提升很多，尤其是在資料維度是 6 時，辨識率可達 100%。

Hint

資料正規化對於辨識率的影響，隨不同的應用與不同的分類器而變，並非一定都可以提升辨識率。

Hint

其實，本章所得到的辨識率，會有一點點流於樂觀。因為我們是用所有的資料來進行 LDA，然後再做 KNNC 的 LOO 辨識率測試。換句話說，LDA 已經「偷看」了所有的資料，所以這個測試所得到的辨識率會偏高一點點。

如果資料維度大於資料筆數，則 LDA 通常會出現錯誤的結果，因為運算過程中，數個矩陣可能接近 Singular，導致計算出來的 eigenvalues 有複數。一般解決的方案，事先用 PCA 投影到低維空間，以便盡量維持資料的空間距離特性，然後再使用 LDA 求取對於分類最佳的投影方式。

References:

J. Duchene and S. Leclercq, "An optimal transformation for discriminant and principal component analysis", IEEE Trans. PAMI, vol. 10, pp.978-983, 1988

More info:

Mathematical treatment of LDA.

Appendix:

How to prove $T=W+B$ in the crisp case: $$ \begin{array}{rcl} T & = & \sum_{i=1}^n (a_i-\mu)(a_i-\mu)^T\\ & = & \sum_{i=1}^n (a_i a_i^T - \mu\mu^T)\\ & = & \sum_{i=1}^n a_i a_i^T - n \mu \mu^T\\ & = & AA^T-n\mu\mu^T\\ \end{array} $$ $$ \begin{array}{rcl} W & = & \sum_{k=1}^c w_k\\ & = & \sum_{k=1}^c (A_k A_k^T - n_k \mu_k \mu_k^T)\\ & = & \sum_{k=1}^c A_k A_k^T - \sum_{k=1}^c n_k \mu_k \mu_k^T\\ & = & AA^T- \sum_{k=1}^c n_k \mu_k \mu_k^T\\ \end{array} $$ $$ \begin{array}{rcl} B & = & \sum_{k=1}^c n_k (\mu_k-\mu)(\mu_k-\mu)^T\\ & = & \sum_{k=1}^c n_k \mu_k \mu_k^T - n\mu\mu^T\\ \end{array} $$ Thus we have $$ T=W+B. $$
How to prove $T=W+B$ in the fuzzy case:
Define
- $u_{k,i}$ = degree of point $i$ in class $k$.
- $\alpha_k = \sum_{i=1}^n u_{k,i}$ = size of class k.
Then $$ \mu_k = \frac{\sum_{i=1}^n u_{k,i} a_i}{\sum_{i=1}^n u_{k,i}} = \frac{\sum_{i=1}^n u_{k,i} a_i}{\alpha_k} \rightarrow \alpha_k \mu_k = \sum_{i=1}^n u_{k,i}a_i. $$ Moreover, we have the following identities: $$ \left\{ \begin{array}{l} \sum_{k=1}^c u_{k,i} = 1.\\ \sum_{k=1}^c \alpha_k = \sum_{i=1}^n \sum_{k=1}^c u_{k,i} = n.\\ \sum_{k=1}^c \alpha_k \mu_k = \sum_{i=1}^n (\sum_{k=1}^c u_{k,i}) a_i = \sum_{i=1}^n a_i = n \mu.\\ \end{array} \right. $$ Using the above identities, we have $$ \begin{array}{rcl} T & = & \sum_{i=1}^n (a_i-\mu)(a_i-\mu)^T\\ & = & \sum_{i=1}^n (a_i a_i^T - \mu\mu^T)\\ & = & \sum_{i=1}^n a_i a_i^T - n \mu \mu^T\\ & = & AA^T-n\mu\mu^T\\ \end{array} $$ $$ \begin{array}{rcl} W_k & = & \sum_{i=1}^n u_{k,i} (a_i-\mu_k)(a_i-\mu_k)^T\\ & = & \sum_{i=1}^n u_{k,i} a_i a_i^T - (\sum_{i=1}^n u_{k,i} a_i) \mu_k^T - \mu_k ({\sum_{i=1}^n u_{k,i}} a_i^T) + \sum_{i=1}^n u_{k,i}\mu_k \mu_k^T\\ & = & \sum_{i=1}^n u_{k,i} a_i a_i^T - (\alpha_k \mu_k)\mu_k^T - \mu_k (\alpha_k \mu_k^T) + \sum_{i=1}^n u_{k,i}\mu_k \mu_k^T\\ & = & \sum_{i=1}^n u_{k,i} a_i a_i^T - \alpha_k \mu_k \mu_k^T\\ \end{array} $$ $$ \begin{array}{rcl} W & = & \sum_{k=1}^c w_k\\ & = & \sum_{i=1}^n \sum_{k=1}^c u_{k,i} a_i a_i^T - \sum_{k=1}^c \alpha_k \mu_k \mu_k^T\\ & = & \sum_{i=1}^n a_i a_i^T - \sum_{k=1}^c \alpha_k \mu_k \mu_k^T\\ \end{array} $$ $$ \begin{array}{rcl} B & = & \sum_{k=1}^c \alpha_k (\mu_k-\mu)(\mu_k-\mu)^T\\ & = & \sum_{k=1}^c \alpha_k \mu \mu^T - (\sum_{k=1}^c \alpha_k \mu_k) \mu^T - \mu (\sum_{k=1}^c \alpha_k \mu_k)^T + (\sum_{k=1}^c \alpha_k) \mu \mu^T\\ & = & \sum_{k=1}^c \alpha_k \mu \mu^T - (n \mu) \mu^T - \mu (n \mu)^T + (n) \mu \mu^T\\ & = & \sum_{k=1}^c \alpha_k \mu_k \mu_k^T - n\mu\mu^T\\ \end{array} $$ Thus we have $T=W+B$ in the fuzzy case where each data point belong to a class to a certain degree.

Data Clustering and Pattern Recognition (資料分群與樣式辨認)