10-2 Feature Selection Methods (嚙磅嚙綞嚙踝蕭嚙踝蕭嚙糊)

[chinese][all]
The goal of feature selection or input selection in pattern recognition is to select the most influential features (inputs) from the original feature set for constructing a classifier that gives better performance. Through the process of feature selection, we can potentially accomplish the following tasks:

The recognition rate is improved.
Less computing power is needed for constructing the classifier.
The selected features can help us understand the causal relationship between features and classes.

The process of feature selection can be formulated mathematically as follows. Given a feature set $V=\{v_1, v_2, \dots, v_d\}$, we can construct a classifier with a recognition rate J(˙) as a function of the selected features. The goal of feature selection is to select a set S in V such that $J(S) \geq J(T)$, where T is all possible subset of V.
In general, there is no efficient way of direct optimum feature selection. Hence we usually rely on some heuristics to overcome the complexity of exhaustive search. Sequential forward selection (SFS for short) proposed by Whitney in 1971 is one of the commonly used heuristic methods for feature selection. It involves the following steps:

Use KNNC as the classifier, and the leave-one-out test for recognition rate estimate.
Select the first feature that has the highest LOO recognition rate among all features.
Select the feature, among all unselected features, together with the selected features that gives the highest recognition rate.
Repeat the previous process until you have selected enough number of features, or until the recognition rate is good enough.

Let us take the Iris dataset for example. There are 150 entries in the Iris dataset with 4 features and 3 categories. The result of SFS on the Iris dataset using the function inputSelectSequential.m is shown in the next example:
Example 1: feaSelBySfs4iris01.m

From the above plot, the selected features are {1}, {2}, {3}, {4}, {3,1}, {3,2}, {3,4}, {3,4,1}, {3,4,2}, {3,4,1,2} during the SFS process. Note that one a feature is put into the selected set, it stays in the set throughout the selection process.
On the other hand, the following example uses exhaustive search for feature selection on the Iris dataset:
Example 2: feaSelByEs4iris01.m

In this example, the selected features are {1}, {2}, {3}, {4}, {1, 2}, {1, 3}, {1, 4}, {2, 3}, {2, 4}, {3, 4}, {1, 2, 3}, {1, 2, 4}, {1, 3, 4}, {2, 3, 4}, {1, 2, 3, 4}. In general, the exhaustive search needs to perform 2^dim-1 leave-one-out tests in order to find the best features.
In the above two examples, the optimum recognition rate occurs when all features are selected. However, this is not the general case since sometimes we only need to use a subset of features to outperform the use of all features. The following example demonstrates that we only need 3 features to achieve the maximum recognition rate when the number of allowable features is 6.
Example 3: feaSelBySfs4wine01.m

Moreover, we can observe that whenever feature 13 is selected, the recognition rate will drop. This observation reminds us that the range of feature 13 might be very different from those of other features. Indeed, this is already confirmed by the example shown in the previous section. As a result, we need to normalize the features before conducting feature selection, as shown in the next example.
Example 4: feaSelBySfs4wine02.m

The only difference between this example and the previous one is that the features are normalized, which enhances the recognition rate from 93.8% to 97.8%. This demonstrates the importance of input (feature) normalization.
Hint
In fact, by using exhaustive search on all features, we can achieve a recognition rate of 99.4% based on the selected features {1, 3, 4, 9, 10, 11, 12, 13}, after conducting 8191 LOO tests.

However, it should be noted that input (feature) normalization does not guarantee improvement. As a result, it is always a good practice to try both the original and the normalized datasets for a given choice of classifier.
Some of the commonly used methods for feature selection, including sequential forward selection and exhaustive search, are listed here according to their computational complexity:

One-pass ranking
Sequential forward selection
Generalized sequential forward selection
Sequential backward selection
Generalized sequential backward selection
"Add m, remove n" selection
Generalized "add m, remove n" selection
Exhaustive search
Reference:

A. Whitney, "A direct method of nonparametric measurement selection", IEEE Transactions on Computers, vol. 20, pp.1100-1103, 1971.

Data Clustering and Pattern Recognition (資料分群與樣式辨認)