%% Tutorial on leaf recognition
% This tutorial covers the basics of leaf recognition based on its shape and color statistics.
% The dataset is availabe at <http://flavia.sourceforge.net http://flavia.sourceforge.net>. 
%% Preprocessing
% Before we start, let's add necessary toolboxes to the search path of MATLAB:
addpath d:/users/jang/matlab/toolbox/utility
addpath d:/users/jang/matlab/toolbox/machineLearning
%%
% All the above toolboxes can be downloaded from the author's <http://mirlab.org/jang/matlab/toolbox toolbox page>.
% Make sure you are using the latest toolboxes to work with this script.
%%
% For compatibility, here we list the platform and MATLAB version that we used to run this script:
fprintf('Platform: %s\n', computer);
fprintf('MATLAB version: %s\n', version);
fprintf('Date & time: %s\n', char(datetime));
scriptStartTime=tic;
%% Dataset construction
% First of all, we shall collect all the image data from the image
% directory. Note that
%
% * The images have been reorganized for easy parsing (with a subfolder for each class), which can be downloaded from <../leafSorted.rar here>.
% * For simplicity, we shall only use 5 classes instead of the original 32 classes.
% * During the data collection, we shall also plot the leaves for each class.
imDir='D:\users\jang\books\dcpr\appNote\leafId\leafSorted';
opt=mmDataCollect('defaultOpt');
opt.extName='jpg';
opt.maxClassNum=5;
imageData=mmDataCollect(imDir, opt, 1);
%% Feature extraction
% For each image, we need to extract the corresponding features for classification.
% We shall use the function "leafFeaExtract" for feature extraction.
% We also need to put all the dataset into a format that is easier for further processing, including classifier construction and evaluation.
opt=dsCreateFromMm('defaultOpt');
if exist('ds.mat', 'file')
	fprintf('Loading ds.mat...\n');
	load ds.mat
else
	myTic=tic;
	opt=dsCreateFromMm('defaultOpt');
	opt.imFeaFcn=@leafFeaExtract;	% Function for feature extraction
	opt.imFeaOpt=feval(opt.imFeaFcn, 'defaultOpt');	% Feature options
	ds=dsCreateFromMm(imageData, opt);
	fprintf('Time for feature extraction over %d images = %g sec\n', length(imageData), toc(myTic));
	fprintf('Saving ds.mat...\n');
	save ds ds
end
%%
% Note that since feature extraction is a lengthy process, we have save the resulting variable "ds" into "ds.mat".
% If needed, you can simply load the file to restore the dataset variable "ds" and play around with it.
% But if you have changed the feature extraction function, be sure to delete ds.mat first to enforce the feature extraction.
%%
% Basically the extracted features are based on the regions separated by Otsu's method.
% We only consider the region with the maximum area, and compute its region properties and color statistics as features.
% You can type "leafFeaExtract" to have a self-demo of the function:
figure; leafFeaExtract;
%% Dataset visualization
% Once we have all the necessary information stored in "ds",
% we can invoke many different functions in Machine Learning Toolbox for
% data visualization and classification.
%%
% For instance, we can display the size of each class:
figure;
[classSize, classLabel]=dsClassSize(ds, 1);
%%
% We can plot the distribution of each features within each class:
figure; dsBoxPlot(ds);
%%
% The box plots indicate the ranges of the features vary a lot. To verify this,
% we can simply plot the range of features of the dataset:
figure; dsRangePlot(ds);
%%
% Big range difference cause problems in distance-based classification. To
% avoid this, we can simply apply z-normalization to each feature:
ds2=ds;
ds2.input=inputNormalize(ds2.input);
%%
% We can now plot the feature vectors within each class:
figure; dsFeaVecPlot(ds); figEnlarge;
%%
% We can also do scatter plots on each pair of the original features:
figure; dsProjPlot2(ds); figEnlarge;
%%
% It is hard to see the above plots due to a large difference in the range of each features.
% We can try the same plot with normalized inputs:
figure; dsProjPlot2(ds2); figEnlarge;
%%
% We can also do the scatter plots in the 3D space:
figure; dsProjPlot3(ds2); figEnlarge;
%%
% In order to visualize the distribution of the dataset,
% we can project the original dataset into 2-D space.
% This can be achieved by LDA (linear discriminant analysis):
ds2d=lda(ds);
ds2d.input=ds2d.input(1:2, :);
figure; dsScatterPlot(ds2d); xlabel('Input 1'); ylabel('Input 2');
title('Features projected on the first 2 lda vectors');
%% Classification
% We can try the most straightforward KNNC (k-nearest neighbor classifier):
rr=knncLoo(ds);
fprintf('rr=%g%% for ds\n', rr*100);
%%
% For normalized dataset, usually we can obtain a better accuracy:
[rr, computed]=knncLoo(ds2);
fprintf('rr=%g%% for ds2 of normalized inputs\n', rr*100);
%%
% We can plot the confusion matrix:
confMat=confMatGet(ds2.output, computed);
opt=confMatPlot('defaultOpt');
opt.className=ds.outputName;
opt.mode='both';
figure; confMatPlot(confMat, opt);
%%
% We can perform sequential input selection to find the best features:
figure; tic; inputSelectSequential(ds2, inf, 'knnc'); toc
%%
% Since the number of features is not too big, we can also exhaustive search to find the best features:
figure; tic; inputSelectExhaustive(ds2, inf, 'knnc'); toc
%%
% It is obvious that the exhaustive search can find the best features, but
% at the cost of more computation.
%%
% We can even perform an exhaustive search on the classifiers and the way
% of input normalization:
opt=perfCv4classifier('defaultOpt');
opt.foldNum=10;
tic; [perfData, bestId]=perfCv4classifier(ds, opt, 1); toc
structDispInHtml(perfData, 'Performance of various classifiers via cross validation');
%%
% We can then display the confusion matrix of the best classifier:
confMat=confMatGet(ds.output, perfData(bestId).bestComputedClass);
opt=confMatPlot('defaultOpt');
opt.className=ds.outputName;
figure; confMatPlot(confMat, opt);
%%
% We can also list all the misclassified images in a table:
for i=1:length(imageData)
	imageData(i).classIdPredicted=perfData(bestId).bestComputedClass(i);
	imageData(i).classPredicted=ds.outputName{imageData(i).classIdPredicted};
end
listOpt=mmDataList('defaultOpt');
mmDataList(imageData, listOpt);
%% Summary
% This is a brief tutorial on leaf recognition based on its shape and color statistics.
% There are several directions for further improvement:
%
% * Explore other features, such as vein distribution
% * Try the classification problem using the whole dataset
% * Use template matching as an alternative to improve the performance
%
%% Appendix
% List of functions, scripts, and datasets used in this script:
%
% * <../leafSorted.rar Dataset> used in this script.
% * <../list.asp List of files in this folder>
%
%%
% Overall elapsed time:
toc(scriptStartTime)
%%
% <http://mirlab.org/jang Jyh-Shing Roger Jang>, created on
datetime
%%
% If you are interested in the original MATLAB code for this page, you can
% type "grabcode(URL)" under MATLAB, where URL is the web address of this
% page.