chinese versionIf we treat each data entry as a point in high-dimensional space, then we shall be able to visualize the dataset if the dimensionality is less than or equal to 3. For instance, the following example plots the dataset (3 dimensions, 300 entries) with 3 class labels using different viewing angles:
The left and right plots are based on the same 300 entries except for different view angles. The left plot seems a bit random since the view angle tends to overlap the projection onto the 2D space. On the otherhand, the right plot give us a much more separation between classes after 2D projection. As you can imagine, the goal of LDA is to find the best projection (or equivalently, the best viewing angle) such that the separation between classes is maximized.
The ultimate goal of feature extraction can be explained in a mathematical manner. Suppose that we have a dataset with features of V = {v1 , v2 ,... , vd}. We wish to have a classifier with recognition rate denoted by J(˙), which is a function of the transform features. Therefore the objective of feature extraction is to find the best of transformed features S such that J(S)≧J(T), where both S and T are sets of linearly transformed features obtained from set V.
Let us use LDA to project the dataset used in the previous example to 2D space:
When we do LDA projection to the first dimensions, it is obvious that the separation between different classes in plot 1 is much larger than that in plot 2.The next example applies LDA to IRIS dataset:
In the above example, we adopt 1-nearest-neighbor classifier and leave-one-out criterion for performance evaluation. In plot 1, the dataset is projected to the first 2 dimensions of LDA and the accuracy is 98.00%, corresponding to 3 misclassified cases. If the dataset is projected to the last 2 dimensions of LDA, as shown in plot 2 where the degree of class overlap becomes larger, the accuracy is 87.33%, corresponding to 19 misclassified cases. (In the plot, the misclassified cases are denoted by a black cross.)We can apply the same procedure for analyzing WINE dataset using LDA:
Again, the projection along the first 2 dimensions has better class separation.The following example demonstrate the effects of LDA dimension reduction with respect to classification accuracy (obtained via KNNC and leave-one-out criterion) of IRIS dataset:
From the plot, we know that the accuracy achieves its maximum at 98.00% when the dimensionality is 2. The corresponding confusion matrix is obtained as follows:
If we denote matrices in plot 1 and 2 as A and B, respectively, then A(i,j) is the count of class i being classified as class j, while B(i, j) is the probability of class i being classified as class j, satisfying B(i, 1) + B(i, 2) + B(i, 3) = 100% for all i。If we apply the same procedure to WINE dataset, the results are shown next:
The accuracy achieves its maximum of 98.31% when the dimensionality is 6. The corresponding confusion matrix is:
If we convert the above example into a function ldaPerfViaKnncLoo.m, then we can test the effect of input (or feature) normalization:
For this dataset, data normalization does improve the accuracy. In particular, when the dimensionality is 6, the corresponding accuracy is 100%。However, it should be aware that the above recognition rates are a little over optimistic since we used all dataset for LDA. A more objective method for LOO test should be like this:
To invoke such objective evaluation, you need to set the mode to 'exact', as follows:
- Hide 1 entry and have all the other entries for LDA
- Use the LDA basis for evaluating the hidden entry
- Repeat the above process until each entry has served the role as the hidden data
Apparently, the 'exact' mode takes longer than the 'approximate' mode.If the data dimension is larger than the data count (number of entries in the dataset), then direct use of LDA is error prone since some of the matrices are singular and the inverse operation is not reliable during LDA computation. To avoid such situation, one possible solution is to use PCA first to do dimension reduction, and then use LDA to find the best projection for classification.
References:
More info:
- J. Duchene and S. Leclercq, "An optimal transformation for discriminant and principal component analysis", IEEE Trans. PAMI, vol. 10, pp.978-983, 1988
Appendix:
- How to prove $T=W+B$ in the crisp case: $$ \begin{array}{rcl} T & = & \sum_{i=1}^n (a_i-\mu)(a_i-\mu)^T\\ & = & \sum_{i=1}^n (a_i a_i^T - \mu\mu^T)\\ & = & \sum_{i=1}^n a_i a_i^T - n \mu \mu^T\\ & = & AA^T-n\mu\mu^T\\ \end{array} $$ $$ \begin{array}{rcl} W & = & \sum_{k=1}^c w_k\\ & = & \sum_{k=1}^c (A_k A_k^T - n_k \mu_k \mu_k^T)\\ & = & \sum_{k=1}^c A_k A_k^T - \sum_{k=1}^c n_k \mu_k \mu_k^T\\ & = & AA^T- \sum_{k=1}^c n_k \mu_k \mu_k^T\\ \end{array} $$ $$ \begin{array}{rcl} B & = & \sum_{k=1}^c n_k (\mu_k-\mu)(\mu_k-\mu)^T\\ & = & \sum_{k=1}^c n_k \mu_k \mu_k^T - n\mu\mu^T\\ \end{array} $$ Thus we have $$ T=W+B. $$
- How to prove $T=W+B$ in the fuzzy case:
Define
Then $$ \mu_k = \frac{\sum_{i=1}^n u_{k,i} a_i}{\sum_{i=1}^n u_{k,i}} = \frac{\sum_{i=1}^n u_{k,i} a_i}{\alpha_k} \rightarrow \alpha_k \mu_k = \sum_{i=1}^n u_{k,i}a_i. $$ Moreover, we have the following identities: $$ \left\{ \begin{array}{l} \sum_{k=1}^c u_{k,i} = 1.\\ \sum_{k=1}^c \alpha_k = \sum_{i=1}^n \sum_{k=1}^c u_{k,i} = n.\\ \sum_{k=1}^c \alpha_k \mu_k = \sum_{i=1}^n (\sum_{k=1}^c u_{k,i}) a_i = \sum_{i=1}^n a_i = n \mu.\\ \end{array} \right. $$ Using the above identities, we have $$ \begin{array}{rcl} T & = & \sum_{i=1}^n (a_i-\mu)(a_i-\mu)^T\\ & = & \sum_{i=1}^n (a_i a_i^T - \mu\mu^T)\\ & = & \sum_{i=1}^n a_i a_i^T - n \mu \mu^T\\ & = & AA^T-n\mu\mu^T\\ \end{array} $$ $$ \begin{array}{rcl} W_k & = & \sum_{i=1}^n u_{k,i} (a_i-\mu_k)(a_i-\mu_k)^T\\ & = & \sum_{i=1}^n u_{k,i} a_i a_i^T - (\sum_{i=1}^n u_{k,i} a_i) \mu_k^T - \mu_k ({\sum_{i=1}^n u_{k,i}} a_i^T) + \sum_{i=1}^n u_{k,i}\mu_k \mu_k^T\\ & = & \sum_{i=1}^n u_{k,i} a_i a_i^T - (\alpha_k \mu_k)\mu_k^T - \mu_k (\alpha_k \mu_k^T) + \sum_{i=1}^n u_{k,i}\mu_k \mu_k^T\\ & = & \sum_{i=1}^n u_{k,i} a_i a_i^T - \alpha_k \mu_k \mu_k^T\\ \end{array} $$ $$ \begin{array}{rcl} W & = & \sum_{k=1}^c w_k\\ & = & \sum_{i=1}^n \sum_{k=1}^c u_{k,i} a_i a_i^T - \sum_{k=1}^c \alpha_k \mu_k \mu_k^T\\ & = & \sum_{i=1}^n a_i a_i^T - \sum_{k=1}^c \alpha_k \mu_k \mu_k^T\\ \end{array} $$ $$ \begin{array}{rcl} B & = & \sum_{k=1}^c \alpha_k (\mu_k-\mu)(\mu_k-\mu)^T\\ & = & \sum_{k=1}^c \alpha_k \mu \mu^T - (\sum_{k=1}^c \alpha_k \mu_k) \mu^T - \mu (\sum_{k=1}^c \alpha_k \mu_k)^T + (\sum_{k=1}^c \alpha_k) \mu \mu^T\\ & = & \sum_{k=1}^c \alpha_k \mu \mu^T - (n \mu) \mu^T - \mu (n \mu)^T + (n) \mu \mu^T\\ & = & \sum_{k=1}^c \alpha_k \mu_k \mu_k^T - n\mu\mu^T\\ \end{array} $$ Thus we have $T=W+B$ in the fuzzy case where each data point belong to a class to a certain degree.
- $u_{k,i}$ = degree of point $i$ in class $k$.
- $\alpha_k = \sum_{i=1}^n u_{k,i}$ = size of class k.
Data Clustering and Pattern Recognition (資料分群與樣式辨認)