11-2 PCA (DnqR)

english version (請注意：中文版本並未隨英文版本同步更新！)

首先，我們看看 PCA 所產生的主軸方向，是否和資料一致，請見下列範例：

Example 1: pca01.mclear j dataNum = 1000; data = randn(1,dataNum)+j*randn(1,dataNum)/3; data = data*exp(j*pi/6); % Rotate 30 degree data = data-mean(data); % Mean subtraction plot(real(data), imag(data), '.'); axis image; DS.input=[real(data); imag(data)]; [DS2, v, eigValue] = pca(DS); v1 = v(:, 1); v2 = v(:, 2); arrow = [-1 0 nan -0.1 0 -0.1]+1+j*[0 0 nan 0.1 0 -0.1]; arrow1 = 2*arrow*(v1(1)+j*v1(2))*eigValue(1)/dataNum; arrow2 = 2*arrow*(v2(1)+j*v2(2))*eigValue(2)/dataNum; line(real(arrow1), imag(arrow1), 'color', 'r', 'linewidth', 4); line(real(arrow2), imag(arrow2), 'color', 'k', 'linewidth', 4); title('Axes for PCA');

很明顯的，PCA 的主軸剛好是延著資料最分散的方向。

在下面的範例，我們針對 150 筆 IRIS 資料進行 PCA，如下：

Example 2: pcaIris01.mDS=prData('iris'); DS2=pca(DS); DS3=DS2; DS3.input=DS3.input(1:2, :); % Keep the first two dimenions subplot(2,1,1); dsScatterPlot(DS3); axis image xlabel('Input 1'); ylabel('Input 2'); title('IRIS projected onto the first two dim of PCA'); DS3=DS2; DS3.input=DS3.input(end-1:end, :); % Keep the last two dimenions subplot(2,1,2); dsScatterPlot(DS3); axis image xlabel('Input 3'); ylabel('Input 4'); title('IRIS projected onto the last two dim of PCA');

在上述範例中，第一個圖是把 IRIS 資料投影於第一和第二個 PCA 向量，第二個圖則是投影於第三和第四個 PCA 向量。很明顯的，在第一個圖形中，資料點分佈很散，而在第二個圖形中，資料點的散佈程度就比較小。（請注意：第二個圖的範圍比第一個圖的範圍小很多。）對於 WINE 資料，我們可以進行類似的計算，範例如下：

Example 3: pcaWine01.mDS=prData('wine'); DS2=pca(DS); DS3=DS2; DS3.input=DS3.input(1:2, :); % Keep the first two dimensions subplot(2,1,1); dsScatterPlot(DS3); axis image xlabel('Input 1'); ylabel('Input 2'); title('WINE projected onto the first two dim of PCA'); DS3=DS2; DS3.input=DS3.input(end-1:end, :); % Keep the last two dimensions subplot(2,1,2); dsScatterPlot(DS3); axis image xlabel('Input 12'); ylabel('Input 13'); title('WINE projected onto the last two dim of PCA');

很明顯的，第一個圖的散佈程度比第二個圖大很多。

Hint

雖然 IRIS 和 WINE 資料集都含有類別資訊，但是我們在上述範例計算 PCA 時，並沒有用到這些類別資訊。

PCA 的概念是「將資料拉開」（或是「將資料投影到變異量最大的子空間」），並沒有考慮到資料的類別，因此嚴格來說，並不完全適用於樣式辨認的問題。但由於「將資料拉開」與「將不同類別的資料拉開」有一些共同性，因此碰到有某一些資料維度太大的樣式辨認問題（例如人臉辨識），PCA 就常被使用，以降低資料維度及運算量。

若以 KNNC 及 leave-one-out 來測試 LDA 投影的維度對辨識率的影響，可使用下列範例程式來測試 iris 資料：

Example 4: pcaIrisDim01.mDS=prData('iris'); [featureNum, dataNum] = size(DS.input); [recogRate, computed] = knncLoo(DS); fprintf('All data ===> LOO recog. rate = %d/%d = %g%%\n', sum(DS.output==computed), dataNum, 100*recogRate); DS2 = pca(DS); recogRate=[]; for i = 1:featureNum DS3=DS2; DS3.input=DS3.input(1:i, :); [recogRate(i), computed] = knncLoo(DS3); fprintf('PCA dim = %d ===> LOO recog. rate = %d/%d = %g%%\n', i, sum(DS3.output==computed), dataNum, 100*recogRate(i)); end plot(1:featureNum, 100*recogRate, 'o-'); grid on xlabel('No. of projected features based on PCA'); ylabel('LOO recognition rates using KNNC (%)');All data ===> LOO recog. rate = 144/150 = 96% PCA dim = 1 ===> LOO recog. rate = 133/150 = 88.6667% PCA dim = 2 ===> LOO recog. rate = 144/150 = 96% PCA dim = 3 ===> LOO recog. rate = 144/150 = 96% PCA dim = 4 ===> LOO recog. rate = 144/150 = 96%

若以相同的方式來測試 WINE 資料，結果如下：

Example 5: pcaWineDim01.mDS=prData('wine'); [featureNum, dataNum] = size(DS.input); [recogRate, computed] = knncLoo(DS); fprintf('All data ===> LOO recog. rate = %d/%d = %g%%\n', sum(DS.output==computed), dataNum, 100*recogRate); DS2 = pca(DS); recogRate=[]; for i = 1:featureNum DS3=DS2; DS3.input=DS3.input(1:i, :); [recogRate(i), computed] = knncLoo(DS3); fprintf('PCA dim = %d ===> LOO recog. rate = %d/%d = %g%%\n', i, sum(DS3.output==computed), dataNum, 100*recogRate(i)); end plot(1:featureNum, 100*recogRate, 'o-'); grid on xlabel('No. of projected features based on PCA'); ylabel('LOO recognition rates using KNNC (%)');All data ===> LOO recog. rate = 137/178 = 76.9663% PCA dim = 1 ===> LOO recog. rate = 126/178 = 70.7865% PCA dim = 2 ===> LOO recog. rate = 128/178 = 71.9101% PCA dim = 3 ===> LOO recog. rate = 130/178 = 73.0337% PCA dim = 4 ===> LOO recog. rate = 135/178 = 75.8427% PCA dim = 5 ===> LOO recog. rate = 136/178 = 76.4045% PCA dim = 6 ===> LOO recog. rate = 137/178 = 76.9663% PCA dim = 7 ===> LOO recog. rate = 137/178 = 76.9663% PCA dim = 8 ===> LOO recog. rate = 137/178 = 76.9663% PCA dim = 9 ===> LOO recog. rate = 137/178 = 76.9663% PCA dim = 10 ===> LOO recog. rate = 137/178 = 76.9663% PCA dim = 11 ===> LOO recog. rate = 137/178 = 76.9663% PCA dim = 12 ===> LOO recog. rate = 137/178 = 76.9663% PCA dim = 13 ===> LOO recog. rate = 137/178 = 76.9663%

以上述兩個範例來看，似乎特徵個數越多，效果越好，換句話說，PCA對於分類而言，似乎沒辦法選取數個有效的特徵來得到辨識率的最大值，這也是一個合理的現象，因為PCA在選取投影的方向時，並未考慮資料的類別資訊。（在下一節中，你可以比較使用 LDA 於同樣的資料集所得到的結果。）

若我們將上述範例寫成一個函數 pcaKnncLoo.m，則可用此函數來測試「資料正規化」對於辨識率的影響，請見下列範例：

Example 6: pcaWineDim02.mDS=prData('wine'); recogRate1=pcaPerfViaKnncLoo(DS); DS2=DS; DS2.input=inputNormalize(DS2.input); % data normalization recogRate2=pcaPerfViaKnncLoo(DS2); [featureNum, dataNum] = size(DS.input); plot(1:featureNum, 100*recogRate1, 'o-', 1:featureNum, 100*recogRate2, '^-'); grid on legend('Raw data', 'Normalized data'); xlabel('No. of projected features based on LDA'); ylabel('LOO recognition rates using KNNC (%)');

有上述範例可以看出，對於這個應用而言，使用了資料正規化，使得辨識率提升不少。但是，整體辨識率還是小於 LDA，請見下一節對於 LDA 的介紹以及相關的範例。

Hint

資料正規化對於辨識率的影響，隨不同的應用與不同的分類器而變，並非一定都可以提升辨識率。

Hint

其實，本章所得到的辨識率，會有一點點流於樂觀。因為我們是用所有的資料來進行 PCA，然後再做 KNNC 的 LOO 辨識率測試。換句話說，PCA 已經「偷看」了所有的資料，所以這個測試所得到的辨識率會偏高一點點。

More info:

Data Clustering and Pattern Recognition (資料分群與樣式辨認)