10-2 Feature Selection Methods (?孵噩?詨??寞?)

[chinese][english]

(請注意:中文版本並未隨英文版本同步更新!)

The goal of feature selection or input selection in pattern recognition is to select the most influential features (inputs) from the original feature set for constructing a classifier that gives better performance. Through the process of feature selection, we can potentially accomplish the following tasks:

「特徵選取」(feature selection)的目標是要從原有的特徵中挑選出最佳的部分特徵,使其辨識率能夠達到最高值。這些鑑別能力較好的特徵,不但能夠簡化分類器的計算,而且也可以幫助我們瞭解此分類問題的因果關係。

The process of feature selection can be formulated mathematically as follows. Given a feature set $V=\{v_1, v_2, \dots, v_d\}$, we can construct a classifier with a recognition rate J(˙) as a function of the selected features. The goal of feature selection is to select a set S in V such that $J(S) \geq J(T)$, where T is all possible subset of V.

我們也可以使用數學的方式來說明特徵選取。假設我們有一組特徵 $V=\{v_1, v_2, \dots, v_d\}$,我們希望透過某種分類器來得到辨識率 J(˙),此辨識率是所選取特徵的函數;而特徵選取的目標,就是要從 V 中找出一組鑑別能力最佳的的子集合 S,使得 $J(S) \geq J(T)$,其中 T 是任何由 W 所形成的子集合。

In general, there is no efficient way of direct optimum feature selection. Hence we usually rely on some heuristics to overcome the complexity of exhaustive search. Sequential forward selection (SFS for short) proposed by Whitney in 1971 is one of the commonly used heuristic methods for feature selection. It involves the following steps:

  1. Use KNNC as the classifier, and the leave-one-out test for recognition rate estimate.
  2. Select the first feature that has the highest LOO recognition rate among all features.
  3. Select the feature, among all unselected features, together with the selected features that gives the highest recognition rate.
  4. Repeat the previous process until you have selected enough number of features, or until the recognition rate is good enough.

一般來說,要選出一組鑑別能力最好的子集合,並沒有什麼好方法,除了簡單的 heuristics 外,就是暴力法,但是暴力法的計算量很大,除非資料量不大,否則我們很難使用暴力法。因此最常用到的是近似最佳法(或稱為 heuristic methods),其中 1971 年由 Whitney 所提出的 Sequential forward selection簡稱 SFS)是最常被用到的近似最佳法,步驟如下:

  1. 使用最近鄰居法則(KNNC)和「一次挑一個」(leave-one-out, LOO)辨識率預估法。
  2. 第一個挑選的特徵必定是辨識率最高的特徵。
  3. 下一個挑選的特徵必定是和原本已選取的特徵合併後,辨識率最高的一個。
  4. 重複步驟 3,直至挑選出全部的特徵。

Let us take the Iris dataset for example. There are 150 entries in the Iris dataset with 4 features and 3 categories. The result of SFS on the Iris dataset using the function inputSelectSequential.m is shown in the next example:

我們可以使用上述方法來對 IRIS 資料進行特徵選取。IRIS 資料共有 150 筆,共分成三類,每一筆資料具有四個特徵,使用 inputSelectSequential.m 來進行 SFS,結果請見下列範例:

Example 1: feaSelBySfs4iris01.mDS=prData('iris'); inputNum=size(DS.input, 1); DS.inputName=cellstr(int2str((1:inputNum)')); inputSelectSequential(DS); Construct 10 knnc models, each with up to 4 inputs selected from 4 candidates... Selecting input 1: Model 1/10: selected={1} => Recog. rate = 58.7% Model 2/10: selected={2} => Recog. rate = 48.0% Model 3/10: selected={3} => Recog. rate = 88.0% Model 4/10: selected={4} => Recog. rate = 88.0% Currently selected inputs: 3 Selecting input 2: Model 5/10: selected={3, 1} => Recog. rate = 90.7% Model 6/10: selected={3, 2} => Recog. rate = 90.7% Model 7/10: selected={3, 4} => Recog. rate = 95.3% Currently selected inputs: 3, 4 Selecting input 3: Model 8/10: selected={3, 4, 1} => Recog. rate = 95.3% Model 9/10: selected={3, 4, 2} => Recog. rate = 95.3% Currently selected inputs: 3, 4, 1 Selecting input 4: Model 10/10: selected={3, 4, 1, 2} => Recog. rate = 96.0% Currently selected inputs: 3, 4, 1, 2 Overall maximal recognition rate = 96.0%. Selected 4 inputs (out of 4): 3, 4, 1, 2

From the above plot, the selected features are {1}, {2}, {3}, {4}, {3,1}, {3,2}, {3,4}, {3,4,1}, {3,4,2}, {3,4,1,2} during the SFS process. Note that one a feature is put into the selected set, it stays in the set throughout the selection process.

從上圖可以看出,SFS 選的特徵分別是 {1}, {2}, {3}, {4}, {3,1}, {3,2}, {3,4}, {3,4,1}, {3,4,2}, {3,4,1,2},但最後卻發覺只要使用 {3},辨識率就可以達到 94.7%,然而這並分是最佳辨識率,因為 SFS 本來就是一個 heuristic 的方法,計算速度很快,但是並不保證可以找出最好的特徵。

On the other hand, the following example uses exhaustive search for feature selection on the Iris dataset:

如果我們改用暴力法,將所有可能的特徵組合全部算一遍,在速度上會比較慢,但由於 IRIS 資料的個數並不大,特徵也只有四個,因此我們可以直接使用 inputSelectExhaustive.m 來進行窮舉式的特徵選取,如下:

Example 2: feaSelByEs4iris01.mDS=prData('iris'); inputNum=size(DS.input, 1); DS.inputName=cellstr(int2str((1:inputNum)')); inputSelectExhaustive(DS); Construct 15 KNN models, each with up to 4 inputs selected from 4 candidates... modelIndex 1/15: selected={1} => Recog. rate = 58.666667% modelIndex 2/15: selected={2} => Recog. rate = 48.000000% modelIndex 3/15: selected={3} => Recog. rate = 88.000000% modelIndex 4/15: selected={4} => Recog. rate = 88.000000% modelIndex 5/15: selected={1, 2} => Recog. rate = 70.666667% modelIndex 6/15: selected={1, 3} => Recog. rate = 90.666667% modelIndex 7/15: selected={1, 4} => Recog. rate = 92.666667% modelIndex 8/15: selected={2, 3} => Recog. rate = 90.666667% modelIndex 9/15: selected={2, 4} => Recog. rate = 92.666667% modelIndex 10/15: selected={3, 4} => Recog. rate = 95.333333% modelIndex 11/15: selected={1, 2, 3} => Recog. rate = 93.333333% modelIndex 12/15: selected={1, 2, 4} => Recog. rate = 94.666667% modelIndex 13/15: selected={1, 3, 4} => Recog. rate = 95.333333% modelIndex 14/15: selected={2, 3, 4} => Recog. rate = 95.333333% modelIndex 15/15: selected={1, 2, 3, 4} => Recog. rate = 96.000000% Overall max recognition rate = 96.0%. Selected 4 inputs (out of 4): 1, 2, 3, 4

In this example, the selected features are {1}, {2}, {3}, {4}, {1, 2}, {1, 3}, {1, 4}, {2, 3}, {2, 4}, {3, 4}, {1, 2, 3}, {1, 2, 4}, {1, 3, 4}, {2, 3, 4}, {1, 2, 3, 4}. In general, the exhaustive search needs to perform 2dim-1 leave-one-out tests in order to find the best features.

由於 IRIS 資料量及特徵維度都不大,因此暴力法只需要計算 15(= 24 - 1)個分類器,所得到的最佳辨識率是 96%,所選到的特徵是 {2,4}。

In the above two examples, the optimum recognition rate occurs when all features are selected. However, this is not the general case since sometimes we only need to use a subset of features to outperform the use of all features. The following example demonstrates that we only need 3 features to achieve the maximum recognition rate when the number of allowable features is 6.

Example 3: feaSelBySfs4wine01.mDS=prData('wine'); inputNum=size(DS.input, 1); DS.inputName=cellstr(int2str((1:inputNum)')); inputSelectSequential(DS, 6); Construct 63 knnc models, each with up to 6 inputs selected from 13 candidates... Selecting input 1: Model 1/63: selected={ 1} => Recog. rate = 59.6% Model 2/63: selected={ 2} => Recog. rate = 55.1% Model 3/63: selected={ 3} => Recog. rate = 37.6% Model 4/63: selected={ 4} => Recog. rate = 40.4% Model 5/63: selected={ 5} => Recog. rate = 50.0% Model 6/63: selected={ 6} => Recog. rate = 60.1% Model 7/63: selected={ 7} => Recog. rate = 70.8% Model 8/63: selected={ 8} => Recog. rate = 35.4% Model 9/63: selected={ 9} => Recog. rate = 49.4% Model 10/63: selected={10} => Recog. rate = 64.0% Model 11/63: selected={11} => Recog. rate = 56.7% Model 12/63: selected={12} => Recog. rate = 58.4% Model 13/63: selected={13} => Recog. rate = 66.9% Currently selected inputs: 7 Selecting input 2: Model 14/63: selected={ 7, 1} => Recog. rate = 89.3% Model 15/63: selected={ 7, 2} => Recog. rate = 74.7% Model 16/63: selected={ 7, 3} => Recog. rate = 71.9% Model 17/63: selected={ 7, 4} => Recog. rate = 76.4% Model 18/63: selected={ 7, 5} => Recog. rate = 79.2% Model 19/63: selected={ 7, 6} => Recog. rate = 70.8% Model 20/63: selected={ 7, 8} => Recog. rate = 75.8% Model 21/63: selected={ 7, 9} => Recog. rate = 75.3% Model 22/63: selected={ 7, 10} => Recog. rate = 93.3% Model 23/63: selected={ 7, 11} => Recog. rate = 88.8% Model 24/63: selected={ 7, 12} => Recog. rate = 75.3% Model 25/63: selected={ 7, 13} => Recog. rate = 71.3% Currently selected inputs: 7, 10 Selecting input 3: Model 26/63: selected={ 7, 10, 1} => Recog. rate = 93.8% Model 27/63: selected={ 7, 10, 2} => Recog. rate = 92.7% Model 28/63: selected={ 7, 10, 3} => Recog. rate = 91.6% Model 29/63: selected={ 7, 10, 4} => Recog. rate = 84.8% Model 30/63: selected={ 7, 10, 5} => Recog. rate = 88.2% Model 31/63: selected={ 7, 10, 6} => Recog. rate = 93.8% Model 32/63: selected={ 7, 10, 8} => Recog. rate = 93.8% Model 33/63: selected={ 7, 10, 9} => Recog. rate = 93.8% Model 34/63: selected={ 7, 10, 11} => Recog. rate = 93.8% Model 35/63: selected={ 7, 10, 12} => Recog. rate = 93.3% Model 36/63: selected={ 7, 10, 13} => Recog. rate = 76.4% Currently selected inputs: 7, 10, 1 Selecting input 4: Model 37/63: selected={ 7, 10, 1, 2} => Recog. rate = 92.7% Model 38/63: selected={ 7, 10, 1, 3} => Recog. rate = 92.7% Model 39/63: selected={ 7, 10, 1, 4} => Recog. rate = 90.4% Model 40/63: selected={ 7, 10, 1, 5} => Recog. rate = 92.7% Model 41/63: selected={ 7, 10, 1, 6} => Recog. rate = 92.7% Model 42/63: selected={ 7, 10, 1, 8} => Recog. rate = 93.8% Model 43/63: selected={ 7, 10, 1, 9} => Recog. rate = 93.8% Model 44/63: selected={ 7, 10, 1, 11} => Recog. rate = 93.3% Model 45/63: selected={ 7, 10, 1, 12} => Recog. rate = 92.7% Model 46/63: selected={ 7, 10, 1, 13} => Recog. rate = 77.0% Currently selected inputs: 7, 10, 1, 8 Selecting input 5: Model 47/63: selected={ 7, 10, 1, 8, 2} => Recog. rate = 93.3% Model 48/63: selected={ 7, 10, 1, 8, 3} => Recog. rate = 91.6% Model 49/63: selected={ 7, 10, 1, 8, 4} => Recog. rate = 90.4% Model 50/63: selected={ 7, 10, 1, 8, 5} => Recog. rate = 92.7% Model 51/63: selected={ 7, 10, 1, 8, 6} => Recog. rate = 93.3% Model 52/63: selected={ 7, 10, 1, 8, 9} => Recog. rate = 93.3% Model 53/63: selected={ 7, 10, 1, 8, 11} => Recog. rate = 93.8% Model 54/63: selected={ 7, 10, 1, 8, 12} => Recog. rate = 92.7% Model 55/63: selected={ 7, 10, 1, 8, 13} => Recog. rate = 77.0% Currently selected inputs: 7, 10, 1, 8, 11 Selecting input 6: Model 56/63: selected={ 7, 10, 1, 8, 11, 2} => Recog. rate = 93.3% Model 57/63: selected={ 7, 10, 1, 8, 11, 3} => Recog. rate = 93.8% Model 58/63: selected={ 7, 10, 1, 8, 11, 4} => Recog. rate = 89.9% Model 59/63: selected={ 7, 10, 1, 8, 11, 5} => Recog. rate = 92.7% Model 60/63: selected={ 7, 10, 1, 8, 11, 6} => Recog. rate = 92.7% Model 61/63: selected={ 7, 10, 1, 8, 11, 9} => Recog. rate = 93.3% Model 62/63: selected={ 7, 10, 1, 8, 11, 12} => Recog. rate = 92.7% Model 63/63: selected={ 7, 10, 1, 8, 11, 13} => Recog. rate = 77.0% Currently selected inputs: 7, 10, 1, 8, 11, 3 Overall maximal recognition rate = 93.8%. Selected 3 inputs (out of 13): 7, 10, 1

Moreover, we can observe that whenever feature 13 is selected, the recognition rate will drop. This observation reminds us that the range of feature 13 might be very different from those of other features. Indeed, this is already confirmed by the example shown in the previous section. As a result, we need to normalize the features before conducting feature selection, as shown in the next example.

Example 4: feaSelBySfs4wine02.mDS=prData('wine'); inputNum=size(DS.input, 1); DS.inputName=cellstr(int2str((1:inputNum)')); DS.input=inputNormalize(DS.input); % Feature normalization inputSelectSequential(DS, 6); Construct 63 knnc models, each with up to 6 inputs selected from 13 candidates... Selecting input 1: Model 1/63: selected={ 1} => Recog. rate = 57.9% Model 2/63: selected={ 2} => Recog. rate = 56.2% Model 3/63: selected={ 3} => Recog. rate = 37.6% Model 4/63: selected={ 4} => Recog. rate = 39.3% Model 5/63: selected={ 5} => Recog. rate = 50.0% Model 6/63: selected={ 6} => Recog. rate = 60.1% Model 7/63: selected={ 7} => Recog. rate = 69.7% Model 8/63: selected={ 8} => Recog. rate = 35.4% Model 9/63: selected={ 9} => Recog. rate = 48.9% Model 10/63: selected={10} => Recog. rate = 64.0% Model 11/63: selected={11} => Recog. rate = 56.2% Model 12/63: selected={12} => Recog. rate = 58.4% Model 13/63: selected={13} => Recog. rate = 67.4% Currently selected inputs: 7 Selecting input 2: Model 14/63: selected={ 7, 1} => Recog. rate = 89.3% Model 15/63: selected={ 7, 2} => Recog. rate = 74.2% Model 16/63: selected={ 7, 3} => Recog. rate = 73.6% Model 17/63: selected={ 7, 4} => Recog. rate = 77.5% Model 18/63: selected={ 7, 5} => Recog. rate = 87.6% Model 19/63: selected={ 7, 6} => Recog. rate = 71.3% Model 20/63: selected={ 7, 8} => Recog. rate = 73.0% Model 21/63: selected={ 7, 9} => Recog. rate = 75.8% Model 22/63: selected={ 7, 10} => Recog. rate = 92.7% Model 23/63: selected={ 7, 11} => Recog. rate = 86.5% Model 24/63: selected={ 7, 12} => Recog. rate = 77.5% Model 25/63: selected={ 7, 13} => Recog. rate = 84.8% Currently selected inputs: 7, 10 Selecting input 3: Model 26/63: selected={ 7, 10, 1} => Recog. rate = 94.4% Model 27/63: selected={ 7, 10, 2} => Recog. rate = 93.3% Model 28/63: selected={ 7, 10, 3} => Recog. rate = 93.8% Model 29/63: selected={ 7, 10, 4} => Recog. rate = 89.3% Model 30/63: selected={ 7, 10, 5} => Recog. rate = 93.3% Model 31/63: selected={ 7, 10, 6} => Recog. rate = 92.7% Model 32/63: selected={ 7, 10, 8} => Recog. rate = 92.7% Model 33/63: selected={ 7, 10, 9} => Recog. rate = 91.6% Model 34/63: selected={ 7, 10, 11} => Recog. rate = 92.7% Model 35/63: selected={ 7, 10, 12} => Recog. rate = 92.7% Model 36/63: selected={ 7, 10, 13} => Recog. rate = 96.6% Currently selected inputs: 7, 10, 13 Selecting input 4: Model 37/63: selected={ 7, 10, 13, 1} => Recog. rate = 94.9% Model 38/63: selected={ 7, 10, 13, 2} => Recog. rate = 94.9% Model 39/63: selected={ 7, 10, 13, 3} => Recog. rate = 94.4% Model 40/63: selected={ 7, 10, 13, 4} => Recog. rate = 93.3% Model 41/63: selected={ 7, 10, 13, 5} => Recog. rate = 95.5% Model 42/63: selected={ 7, 10, 13, 6} => Recog. rate = 94.9% Model 43/63: selected={ 7, 10, 13, 8} => Recog. rate = 94.9% Model 44/63: selected={ 7, 10, 13, 9} => Recog. rate = 93.8% Model 45/63: selected={ 7, 10, 13, 11} => Recog. rate = 96.6% Model 46/63: selected={ 7, 10, 13, 12} => Recog. rate = 95.5% Currently selected inputs: 7, 10, 13, 11 Selecting input 5: Model 47/63: selected={ 7, 10, 13, 11, 1} => Recog. rate = 97.2% Model 48/63: selected={ 7, 10, 13, 11, 2} => Recog. rate = 96.6% Model 49/63: selected={ 7, 10, 13, 11, 3} => Recog. rate = 94.4% Model 50/63: selected={ 7, 10, 13, 11, 4} => Recog. rate = 95.5% Model 51/63: selected={ 7, 10, 13, 11, 5} => Recog. rate = 94.9% Model 52/63: selected={ 7, 10, 13, 11, 6} => Recog. rate = 93.8% Model 53/63: selected={ 7, 10, 13, 11, 8} => Recog. rate = 93.3% Model 54/63: selected={ 7, 10, 13, 11, 9} => Recog. rate = 94.9% Model 55/63: selected={ 7, 10, 13, 11, 12} => Recog. rate = 94.9% Currently selected inputs: 7, 10, 13, 11, 1 Selecting input 6: Model 56/63: selected={ 7, 10, 13, 11, 1, 2} => Recog. rate = 97.2% Model 57/63: selected={ 7, 10, 13, 11, 1, 3} => Recog. rate = 97.2% Model 58/63: selected={ 7, 10, 13, 11, 1, 4} => Recog. rate = 96.6% Model 59/63: selected={ 7, 10, 13, 11, 1, 5} => Recog. rate = 97.8% Model 60/63: selected={ 7, 10, 13, 11, 1, 6} => Recog. rate = 96.6% Model 61/63: selected={ 7, 10, 13, 11, 1, 8} => Recog. rate = 97.2% Model 62/63: selected={ 7, 10, 13, 11, 1, 9} => Recog. rate = 96.6% Model 63/63: selected={ 7, 10, 13, 11, 1, 12} => Recog. rate = 96.6% Currently selected inputs: 7, 10, 13, 11, 1, 5 Overall maximal recognition rate = 97.8%. Selected 6 inputs (out of 13): 7, 10, 13, 11, 1, 5

The only difference between this example and the previous one is that the features are normalized, which enhances the recognition rate from 93.8% to 97.8%. This demonstrates the importance of input (feature) normalization.

Hint
In fact, by using exhaustive search on all features, we can achieve a recognition rate of 99.4% based on the selected features {1, 3, 4, 9, 10, 11, 12, 13}, after conducting 8191 LOO tests.

However, it should be noted that input (feature) normalization does not guarantee improvement. As a result, it is always a good practice to try both the original and the normalized datasets for a given choice of classifier.

Some of the commonly used methods for feature selection, including sequential forward selection and exhaustive search, are listed here according to their computational complexity:

  1. One-pass ranking
  2. Sequential forward selection
  3. Generalized sequential forward selection
  4. Sequential backward selection
  5. Generalized sequential backward selection
  6. "Add m, remove n" selection
  7. Generalized "add m, remove n" selection
  8. Exhaustive search
Reference:
  1. A. Whitney, "A direct method of nonparametric measurement selection", IEEE Transactions on Computers, vol. 20, pp.1100-1103, 1971.

Data Clustering and Pattern Recognition (資料分群與樣式辨認)