Intro. to Artificial Intelligence (CS4601): HW2

Intro. to Artificial Intelligence (CS4601):
HW 2 for Pattern Recognition

Instructor: J.-S. R. Jang

Due date: Dec. 9, 1997

In this homework, you are going to develop several MATLAB programs for implementing the k nearest neighbor rule (KNNR). You can start with the MATLAB script go.m that I wrote for this homework:

close all	% Close all figure windows
clear all	% Clear all variables in memory
fprintf('Loading "abalone.dat"...\n');
load abalone.dat	% Load the data set
feature_n = size(abalone, 2)-1;	% no. of features
instance_n = size(abalone, 1);	% no. of instances 
feature = abalone(:, 1:feature_n);	% feature matrix 
output = abalone(:, feature_n+1);	% output matrix
[a, b] = countele(output);
class_n = length(a);	% No. of classes
fprintf('%g features\n', feature_n);
fprintf('%g instances\n', instance_n);
fprintf('%g classes\n', class_n);
% Plot age distribution
bar(a, b);
xlabel('Age');
ylabel('Counts');
title('Age Distribution for the Abalone Data Set');
fprintf('Class 1: %g instances are younger than 10 years\n',...
	length(find(output<10)));
fprintf('Class 2: %g instances are equal to or older than 10 years\n',...
	length(find(output>=10)));

% Modify the data set such that instances younger than 10 years fall
% into class 1; all the others into class 2
index1 = find(output<10);
output(index1) = 1*ones(size(index1));
index2 = find(output>=10);
output(index2) = 2*ones(size(index2));

% Data normalization to have zero mean and unity variance r.v.
new_feature = normal(feature);
data = [new_feature output];

% Partition the data sets for hold-out tests
index1 = 1:2:instance_n;
index2 = 2:2:instance_n;
data1 = data(index1, :);
data2 = data(index2, :);

k = 3;	% for 3 nearest neighbor
%tic
%label = knnr(data1, data2, k);
%toc

% hold-out test 1
desired_label = data2(:, feature_n+1);
label = zeros(size(desired_label));
tic
for i = 1:size(data2, 1),
	if rem(i, 100)==0,
		fprintf('%g/%g\n', i, size(data2,1));
	end
	label(i) = knnr(data1, data2(i, :), k);
end
toc
right_count = sum(label==desired_label);
recog_rate = right_count/length(desired_label);
fprintf('Recognition rate = %g/%g = %g\n', ...
	right_count, size(data2, 1), recog_rate);

% Swap data sets
temp = data1;
data1 = data2;
data2 = temp;

% hold-out test 2
desired_label = data2(:, feature_n+1);
label = zeros(size(desired_label));
tic
for i = 1:size(data2, 1),
	if rem(i, 100)==0,
		fprintf('%g/%g\n', i, size(data2,1));
	end
	label(i) = knnr(data1, data2(i, :), k);
end
toc
right_count = sum(label==desired_label);
recog_rate = right_count/length(desired_label);
fprintf('Recognition rate = %g/%g = %g\n', ...
	right_count, size(data2, 1), recog_rate);

The preceding script uses the K nearest neighbor rule (KNNR) to classify the abalone data set into two categories:

Category 1: Those with ages less than 10 years
Category 2: Those with ages equal to or more than 10 years

As you can see, I have add many comments (those starts with %) into the script to make it self-explanatory. So before you proceed further, you should make sure that you understand every statement in the script. (If you have any questions, feel free to ask me directly.)

To run the script, firstly you need to download the following files:

go.m: the above script
countele.m: for counting elements in a vector
vecdist.m: for computing distance between two sets of vectors
abalone.dat: the abalone data set

You should understand the functions of these files before going further. Next, you need to develop the following M-files to make go.m execute correctly:

normal.m: for data (Gaussian) normalization
knnr.m: for k nearest neighbor rule

Other commands that you might want to use in your program are size, mean, cov, diag, sort, reshape, max, zeros, ones. If you want to know the usage of a function xxx, you can just type "help xxx" within MATLAB to view the on-line help. Also to reduce execution time, you should try to use vectorized operations whenever possible.

If you have implemented these two functions correctly, you should be able to execute the script go.m by typing "go" under MATLAB. After about 10 minutes (on my Pentium-200 with 64 MB of RAM), you should be able to see the recognition rates of 0.750479 and 0.754907, respectively, for the hold-out tests.

Please turn in a floppy disk containing these two files and our TA will run go.m (on MATLAB version 4.2) to see if your work is right or not.

Intro. to Artificial Intelligence (CS4601): HW 2 for Pattern Recognition

Instructor: J.-S. R. Jang

Intro. to Artificial Intelligence (CS4601):
HW 2 for Pattern Recognition