5-2 K-nearest-neighbor Classifiers

[chinese][english]

Slides

The k-nearest-neighbor classifier (KNNC for short) is one of the most basic classifiers for pattern recognition or data classification. The principle of this method is based on the intuitive concept that data instances of the same class should be closer in the feature space. As a result, for a given data point x of unknown class, we can simply compute the distance between x and all the data points in the training data, and assign the class determined by the K nearest points of x. Due to its simplicity, KNNC is often used as a baseline method in comparison with other sophisticated approaches in pattern recognition.

最近鄰居分類法(nearest neighbor classification,簡稱 NNC)所根據的基礎是「物以類聚」,換句話說,同一類的物件應該會聚集在一起,換成數學語言來說,也就是同一類別的物件,若以高度空間中的點來表示,則這些點的距離應該會比較接近。因此,那麼對於一個未知類別的一筆資料,我們只要找出來在訓練資料中和此筆資料最接近的點,就可以判定此筆資料的類別應該和最接近的點的類別是一樣的。最近鄰居分類法是一個最直覺的分類法,在測試各種分類器時,幫被當成是最基礎的分類器,以便和其他更複雜的分類器進行效能比較。

We shall define KNNC in a more rigorous manner. Suppose that we are given a training dataset of n points with their desired class, as shown below:

我們也可以使用數學的語言來描述最近鄰居分類法。對於一個特定的分類問題,假設我們已經收集到一組包含類別資訊的訓練資料,共有 n 筆資料,可以表示如下: $$\{(\mathbf{x}_i, y_i), (\mathbf{x}_2, y_2), \dots, (\mathbf{x}_n, y_n)\}$$

where (xi, yi) represent data pair i, with xi as the feature vector and yi as the corresponding target class. Then for a new data point x, the most likely class should be determined by KNNC (k = 1 in this case), as follows:

其中 (xi, yi) 代表第 i 筆資料(xi 是特徵向量,yi 是類別資訊),那麼對於一筆新的測試資料 x,若根據最近鄰居分類法,最有可能的類別是 $$nnc(\mathbf{x}, 1) = y_p, p = \arg \min_i \| \mathbf{x} - \mathbf{x}_i \|^2.$$

The preceding equation uses the nearest neighbor to determine the class. Alternatively, we can have K nearest neighbors to determine the class by voting. As the extension to KNNC is straightforward, we shall not formulate it separately.

In the next example, we shall apply KNNC with K=1 for the IRIS dataset, where odd and even indexed entries are used as training and test data, respectively.

在下面這個範例,我們使用 NNC 於 IRIS 資料,奇數筆資料為訓練資料(training set,或稱為設計資料,design set),偶數筆資料為測試資料(test set),計算得知辨識率為 96%,範例如下:

Example 1: knnc01.m% Get the data set [trainSet, testSet]=prData('iris'); trainNum=size(trainSet.input, 2); testNum =size(testSet.input, 2); fprintf('Use of KNNC for Iris data:\n'); fprintf('\tSize of design set (odd-indexed data)= %d\n', trainNum); fprintf('\tSize of test set (even-indexed data) = %d\n', testNum); trainSet.k=1; computed=knncEval(testSet, trainSet); correctCount=sum(testSet.output==computed); recogRate=correctCount/testNum; fprintf('Recognition rate = %g%%\n', recogRate*100);Use of KNNC for Iris data: Size of design set (odd-indexed data)= 75 Size of test set (even-indexed data) = 75 Recognition rate = 96%

From the above example, the final recognition rate of KNNC for iris dataset is 96%.

In the 2-dimensional space, the decision boundaries formed by KNNC with k = 1 can be derived from the Voronoi diagram, as explained below:

在二度空間中,由最近鄰居分類法所形成的決策分界(decision boundary)是由 Voronoi Diagram 所形成,其過程如下:

  1. Create the Delauny triangles from the training data points.
  2. Identify the perpendicular bisector for each side of the triangle. These perpendicular bisectors form small cells, also known as the Voronoi diagram.
  3. Determine the class of a cell according to the class of the data point contained in this cell.
  4. Boundaries between cells of different classes are the decision boundaries for 1NNC.
  1. 由訓練資料產生 Delauny triangles。
  2. 找出每一個三角形的中垂線,這些中垂線即可行程 Voronoi diagram。
  3. 根據資料類別,決定每一個 cell 所屬的類別。
  4. 不同類別的 cell,其共有邊界即可形成最近鄰居分類法的決策邊界。
The following example demonstrates the decision boundary formed by 1NNC:

Example 2: knncPlot02.mDS=prData('3classes'); knncPlot(DS, [], 'decBoundary');

We can also compute the distance to each class for a given input domain, and use the distance to create a posterior-like surface for each class:

Example 3: knncPlot01.mDS=prData('3classes'); knncPrm=knncTrain(DS); knncPrm.k=1; knncPlot(DS, knncPrm, '2dPosterior');

Hint
If you want to try an interactive demo of the Voronoi diagram, please change directory to the 'cleve' directory under the example directory of this book and type "vshow" under MATLAB. The demo is by Cleve Moler, to whom the author would like to acknowledge with gratitude.

Hint
若要嚐試 Voronoi diagram 的動畫展示,可以進入本書範例程式目錄之下的 cleve 目錄,並執行 vshow.m。此程式碼由知名數學家 Cleve Moler 所提供,特此致謝!

In the first example, we have computed the recognition rate to be 96% without breakdown performance of each class. If we want to know the performance of each class, we can use the confusion matrix to display the result, as shown in the next example. 在第一個範例中,我們算出了辨識率是 96%,但是我們並不知道每一個類別的辨識率。若要瞭解每一個類別的辨識率,可以算出「混淆矩陣」(confusion matrix),範例如下:

Example 4: knnc02.m% Get the data set [trainSet, testSet]=prData('iris'); trainNum=size(trainSet.input, 2); testNum =size(testSet.input, 2); trainSet.k=1; computed=knncEval(testSet, trainSet); confMat=confMatGet(testSet.output, computed); opt=confMatPlot('defaultOpt'); opt.className=trainSet.outputName; confMatPlot(confMat, opt);

In the above example, the left plot is the confusion matrix C where C(i, j) is the number of data in class i being classified as class j. The right plot is the corresponding recognition rates for each class. From the confusion matrix, the recognition rates for classes 1, 2, and 3 are 100%, 96%, and 92%, respectively, with an overall recognition rate of 96%. Moreover, from the confusion matrix, we know that classes 2 and 3 are more likely to be confused since there are only 3 cases of misclassification, one is 2 => 3, the other 3 => 2. (Here we use "i => j" to indicate an entry in class i being classified as class j.)

在上述範例中,左圖既是混淆矩陣 C,其中 C(i, j) 就是代表第 i 類的資料被分到第 j 類的個數,右圖則是對應的辨識百分比。由上圖可以看出,第三類的辨識率較差(92%),第二類次之(96%),而第一類的辨識率最好,可以達到 100%。同時我們也可以看出,第二類和第三類比較容易造成 KNNC 分類器的混淆,因為所有錯誤的發生,都發生在第二類資料有一筆被分到第三類,以及第三類有兩筆資料被分到第二類。

If the training data is well separated among different classes, then KNNC can usually achieve a satisfactory recognition rate. However, the performance (in terms of computing time and recognition rates) of KNNC degrades in the following situations:

一般而言,如果訓練資料很乾淨,那麼 KNNC 的效果都還可以接受。但是 KNNC 在下列兩種情況會發生問題:

An alternative to alleviate the above problems is to employ k-means (or the like) to select representative points of each class for KNNC. This procedure can effectively reduce KNNC computation, with a side benefit of removing the effect of noisy training data. Using k-means to help classification involves two steps:

  1. For each class, use k-means (or the like) to select representative points for each class. The number of representative points should be much smaller than the original data points in a class.
  2. Use the representative points for KNNC.

要解決上述問題,一個簡單的方式,就是在使用訓練資料時,先使用 k-means 分群法來求取每個類別的數個代表點,然後再使用這些代表點來進行 KNNC。 In the following example, we use k-means to reduce the IRIS dataset for KNNC, where the training data size is reduced from 25 to 3 for each class, as shown next. 在下面這個範例,我們以 IRIS 資料集來進行 KNNC 分類測試,我們先使用 k-means 將前一個範例中的 訓練資料由 75 筆降到 9 筆,然後計算辨識率,範例如下:

Example 5: knncViaKmeans01.m% ====== Get the data set [trainSet, testSet]=prData('iris'); % Get the data set fprintf('Use of KNNC for Iris data:\n'); fprintf('\tSize of design set (odd-indexed data)= %d\n', size(trainSet.input, 2)); fprintf('\tSize of test set (even-indexed data) = %d\n', size(testSet.input, 2)); uniqueClass=unique(trainSet.output); centerNum4eachClass=4; % ====== Perform k-means trainSet2=trainSet; % trainSet2 is the reduced design set after k-means trainSet2.input=[]; trainSet2.output=[]; for i=1:length(uniqueClass) % Compute trainSet2 using k-means index=find(trainSet.output==uniqueClass(i)); thisData=trainSet.input(:, index); center = kMeansClustering(thisData, centerNum4eachClass); trainSet2.input=[trainSet2.input, center]; trainSet2.output=[trainSet2.output, uniqueClass(i)*ones(1, centerNum4eachClass)]; end fprintf('After k-means clustering, size of design set = %d\n', size(trainSet2.input, 2)); % ====== Compute the recognition rate trainSet2.k=1; computed=knncEval(testSet, trainSet2); correctCount=sum(testSet.output==computed); recogRate=correctCount/length(testSet.output); fprintf('Recognition rate = %g%%\n', recogRate*100); % ====== Plot the confusion matrix confMat=confMatGet(testSet.output, computed); opt=confMatPlot('defaultOpt'); opt.className=trainSet.outputName; confMatPlot(confMat, opt);Use of KNNC for Iris data: Size of design set (odd-indexed data)= 75 Size of test set (even-indexed data) = 75 After k-means clustering, size of design set = 12 Recognition rate = 97.3333%

Though the size of training data was reduced from 25 to 4 per class, the overall recognition rate is increased from 96% to 98.67%. (Two less cases of 3 => 2). This example demonstrates that data reduction using k-means not only reduce the data size but also increase the recognition rate. However, in general, we do not know the best number of representative points for each class in advance. One way to identify the best number of representative points is to resort to exhaustive search, which is left as an exercise.

由上述範例可以看出,當 KNNC 比對資料量只有原先資料量的12/75時,辨識率變成 97.33%(第三類分到第二類的錯誤少了兩筆,請見上圖),反而比原先的 96.00% 還好,由此可見 k-means 找出的代表點果然具有代表性,不但降低了資料量及計算量,而且也提高了辨識率。但是,我們並無法事先知道分群數目的最佳值,只能靠窮舉法來測試,請見本章的相關作業。 Since the use of k-means clustering for generating prototypes for each class is commonly adopted in KNNC, we have put the mechanism into a function knncTrain.m. Example follows:

Example 6: knncPlot04.m[trainSet, testSet]=prData('3classes'); knncTrainPrm.method='kMeans'; knncTrainPrm.centerNum4eachClass=4; knncPrm=knncTrain(trainSet, knncTrainPrm); knncPrm.k=1; computed=knncEval(testSet, knncPrm); testSet.hitIndex=find(computed==testSet.output); knncPlot(testSet, knncPrm, 'decBoundary');

It should be noted that the k-means based data reduction is performed on each class sequentially. It only takes intra-class data distribution into consideration. If we want to take inter-class data distribution into consideration for better performance, we can apply LVQ (learning vector quantization) after k-means subsequently. LVQ will be covered in the next section.

If there is a lot of overlap between any two classes in the dataset, it would be more robust to use KNNC with k larger than 1, where the final decision is based on the voting results among these k nearest neighbors. Since there are k data points around the neighborhood to join the decision process, the result is usually more robust. In the following example, we apply KNNC to the IRIS dataset and let the value of K varies from 1 to 15 to see the corresponding recognition rates, as follows.

如果訓練資料的雜訊很強,這時候若只有使用最近鄰居來決定類別,可能會失之武斷,因此另外一個常見的做法,是先求取最接近的 k 個資料點,再根據對應的 k 個類別資訊來進行投票,來決定最後的類別。這種方法稱為 k 個最近鄰居分類法(k nearest neighbor rule,簡稱 KNNC),也就是以前 k 個最靠近的鄰居來投票決定自己的類別。在下面這個範例,我們使用 KNNC 於 IRIS 資料,並讓 k 的值由 1 變化到 15,然後將辨識率對於 k 值的曲線畫出來,如下:

Example 7: knncRecogVsK01.m% Get the data set [trainSet, testSet]=prData('iris'); designNum=size(trainSet.input, 2); testNum =size(testSet.input, 2); fprintf('Use of KNNC for Iris data:\n'); fprintf('\tSize of design set (odd-indexed data)= %d\n', designNum); fprintf('\tSize of test set (even-indexed data) = %d\n', testNum); fprintf('\tRecognition rates as K varies:\n'); kMax=15; for k=1:kMax trainSet.k=k; computed=knncEval(testSet, trainSet); correctCount=sum(testSet.output==computed); recog(k)=correctCount/testNum; fprintf('\t%d-NNC ===> RR = 1-%d/%d = %.2f%%.\n', k, testNum-correctCount, testNum, recog(k)*100); end plot(1:kMax, recog*100, 'b-o'); grid on; title('Recognition rates of Iris data using K-NNR'); xlabel('K'); ylabel('Recognition rates (%)');Use of KNNC for Iris data: Size of design set (odd-indexed data)= 75 Size of test set (even-indexed data) = 75 Recognition rates as K varies: 1-NNC ===> RR = 1-3/75 = 96.00%. 2-NNC ===> RR = 1-4/75 = 94.67%. 3-NNC ===> RR = 1-3/75 = 96.00%. 4-NNC ===> RR = 1-3/75 = 96.00%. 5-NNC ===> RR = 1-2/75 = 97.33%. 6-NNC ===> RR = 1-2/75 = 97.33%. 7-NNC ===> RR = 1-2/75 = 97.33%. 8-NNC ===> RR = 1-1/75 = 98.67%. 9-NNC ===> RR = 1-4/75 = 94.67%. 10-NNC ===> RR = 1-2/75 = 97.33%. 11-NNC ===> RR = 1-2/75 = 97.33%. 12-NNC ===> RR = 1-3/75 = 96.00%. 13-NNC ===> RR = 1-3/75 = 96.00%. 14-NNC ===> RR = 1-2/75 = 97.33%. 15-NNC ===> RR = 1-2/75 = 97.33%.

It can be observed that a too small or too big value of K does not produce good performance. The best value of K, in this case, is 5 or 7 or 9, which all achieves an overall recognition rate of 98.67%. However, different applications require different values of K and usually we rely on exhaustive search as shown in this example to determine the best value of K.

由上圖可看出,太大或太小的 k 值,對辨識率都有不良的影響。至於最好的 k 值,完全是取決於資料,並沒有其他好方法來決定。

To avoid a tie in the voting process of KNNC, usually we have the following strategies:

如果在這 k 個入圍者當中,投票發生「同燈同分」(tie)的情況,此時有兩種作法:


Data Clustering and Pattern Recognition (資料分群與樣式辨認)