First of all, to verify the functionity of PDA, we can display PCA-generated basis for an ovally-distributed dataset, as shown next.
It is obvious that the principal component (first direction for the projection basis) is along the direction where the data dispersion after projection is maximized.
In the next example, we perform PCA on the 150 entries of IRIS dataset:
The first plot demonstrates the projection of the dataset along the first and second principal components, while the second plot displays the same dataset projection along the third and fourth principal components. Again, it is obvious that the first plot has a wider dispersion than the second plot. (Note that the second plot has a much smaller range than the first plot, indicating the variance after projection is also much smaller.)
For WINE dataset, we can perform similiar computation, as follows:
Again, the variance of the first plot is much larger than the second plot.
The goal of PCA is to have the maximum variance after projection, where the class labels, if exist, are not considered in the projection. As a result, PCA is not really optimized for datasets of classification problems. However, since "maximum variance after projection" and "maximum separation between classes after projection" have some characteristics in common, sometimes we use PCA for classification problems too. For instance, in the case of face recognition, the dimension of each face image is so large, and we need to apply PCA for dimension reduction and thus better accuracy.
In the next example, we test the effects of PCA dimension reduction on the classification accuracy of IRIS dataset via KNNC and leave-one-out:
If we apply the same procedure for WINE dataset, the result is shown next:
From the above two examples, it seems the more features (after PCA projection) we have, the better accuracy we obtain. In other words, straightforward application of PCA cannot select the most effective features for classification. This is reasonable since the class labels are not used for determining the basis for projection of PCA. (Please compare the results with those obtained from LDA in the next section.)
We can convert the above two examples into a function pcaKnncLoo.m for testing the effects of PCA dimension reduction. By using the function, we can try data normalization to see how it can improve the accuracy, as follows:
From the example, it is obvious that data normalization does improve the accuracy significantly for this dataset. (But the overall accuracy is still not as good as LDA, as explained in the next section.)