In this homework, you are going to develop several MATLAB programs for implementing the k nearest neighbor rule (KNNR). You can start with the MATLAB script go.m that I wrote for this homework:
close all % Close all figure windows clear all % Clear all variables in memory fprintf('Loading "abalone.dat"...\n'); load abalone.dat % Load the data set feature_n = size(abalone, 2)-1; % no. of features instance_n = size(abalone, 1); % no. of instances feature = abalone(:, 1:feature_n); % feature matrix output = abalone(:, feature_n+1); % output matrix [a, b] = countele(output); class_n = length(a); % No. of classes fprintf('%g features\n', feature_n); fprintf('%g instances\n', instance_n); fprintf('%g classes\n', class_n); % Plot age distribution bar(a, b); xlabel('Age'); ylabel('Counts'); title('Age Distribution for the Abalone Data Set'); fprintf('Class 1: %g instances are younger than 10 years\n',... length(find(output<10))); fprintf('Class 2: %g instances are equal to or older than 10 years\n',... length(find(output>=10))); % Modify the data set such that instances younger than 10 years fall % into class 1; all the others into class 2 index1 = find(output<10); output(index1) = 1*ones(size(index1)); index2 = find(output>=10); output(index2) = 2*ones(size(index2)); % Data normalization to have zero mean and unity variance r.v. new_feature = normal(feature); data = [new_feature output]; % Partition the data sets for hold-out tests index1 = 1:2:instance_n; index2 = 2:2:instance_n; data1 = data(index1, :); data2 = data(index2, :); k = 3; % for 3 nearest neighbor %tic %label = knnr(data1, data2, k); %toc % hold-out test 1 desired_label = data2(:, feature_n+1); label = zeros(size(desired_label)); tic for i = 1:size(data2, 1), if rem(i, 100)==0, fprintf('%g/%g\n', i, size(data2,1)); end label(i) = knnr(data1, data2(i, :), k); end toc right_count = sum(label==desired_label); recog_rate = right_count/length(desired_label); fprintf('Recognition rate = %g/%g = %g\n', ... right_count, size(data2, 1), recog_rate); % Swap data sets temp = data1; data1 = data2; data2 = temp; % hold-out test 2 desired_label = data2(:, feature_n+1); label = zeros(size(desired_label)); tic for i = 1:size(data2, 1), if rem(i, 100)==0, fprintf('%g/%g\n', i, size(data2,1)); end label(i) = knnr(data1, data2(i, :), k); end toc right_count = sum(label==desired_label); recog_rate = right_count/length(desired_label); fprintf('Recognition rate = %g/%g = %g\n', ... right_count, size(data2, 1), recog_rate);
The preceding script uses the K nearest neighbor rule (KNNR) to classify the abalone data set into two categories:
As you can see, I have add many comments (those starts with %) into the script to make it self-explanatory. So before you proceed further, you should make sure that you understand every statement in the script. (If you have any questions, feel free to ask me directly.)
To run the script, firstly you need to download the following files:
You should understand the functions of these files before going further. Next, you need to develop the following M-files to make go.m execute correctly:
Other commands that you might want to use in your program are size, mean, cov, diag, sort, reshape, max, zeros, ones. If you want to know the usage of a function xxx, you can just type "help xxx" within MATLAB to view the on-line help. Also to reduce execution time, you should try to use vectorized operations whenever possible.
If you have implemented these two functions correctly, you should be able to execute the script go.m by typing "go" under MATLAB. After about 10 minutes (on my Pentium-200 with 64 MB of RAM), you should be able to see the recognition rates of 0.750479 and 0.754907, respectively, for the hold-out tests.
Please turn in a floppy disk containing these two files and our TA will run go.m (on MATLAB version 4.2) to see if your work is right or not.