5-5 Naive Bayes Classifiers (���������������������

[chinese][english]

(½Ðª`·N¡G¤¤¤åª©¥»¨Ã¥¼ÀH­^¤åª©¥»¦P¨B§ó·s¡I)

Slides

If we assume that the features of the given dataset are independent, then we can perform PDF modeling on each dimension sequentially. The final PDF for each class is then given by the multiplication of the PDF of the same class over each dimension. In symbols, we assume that where X = [X1, X2, ..., Xd] is a feature vector and C is a class. Despite the assumption seems too strong for real-world data, the resulting naive Bayes classifier (NBC for short) is highly successful in many practical applications.

¦pªG§Ú­Ì°²³]¦bµ¹©wªº¸ê®Æ¶°¤¤¡A¨C¤@ºûªº¸ê®Æ³£¬O¿W¥ßªº¡A¦b¦¹°²³]¤U¡A¨C¤@Ãþ¸ê®ÆªºPDF¥i¥H²¤Æ¦¨¦¹Ãþ¸ê®Æ¦b¨C¤@ºûªºPDFªº­¼¿n¡C´«¥y¸Ü»¡¡A§Ú­Ì¥i¥H¥ýºâ¥X¨C¤@Ãþ¸ê®Æ¦b¨C¤@­Óºû«×©Ò¹ïÀ³ªºPDF¡AµM«á±N¦P¤@Ãþ¸ê®Æªº¼Æ­ÓPDF¶i¦æ³s­¼¡A´N¥i¥H±o¨ì³o¤@Ãþ¸ê®ÆªºPDF¡C§Ú­Ìªº°²³]¥i¥H¨Ï¥Î¼Æ¾Ç¦¡¤l¨Óªí¥Ü¦p¤U¡G

p(X|C) = P(X1|C)P(X2|C) ... P(Xd|C)

where X = [X1, X2, ..., Xd] is a feature vector and C is a class. Despite the assumption seems too strong for real-world data, the resulting naive Bayes classifier (NBC for short) is highly successful in many practical applications.

¨ä¤¤ X = [X1, X2, ..., Xd] ¬O¤@­Ó¯S¼x¦V¶q¡A¦Ó C ¥Nªí¤@­Ó¯S©wÃþ§O¡C³o­Ó°²³]¬Ý¨Ó¦ü¥G¹L±j¡A¤@¯ë¹ê»Ú¥@¬Éªº¸ê®Æ¦ü¥GµLªkº¡¨¬¦¹°²³]¡A¦ý¥Ñ¦¹°²³]©Ò²£¥Íªº³æ¯Â¨©¤ó¤ÀÃþ¾¹¡]naive Bayes classifier¡A²ºÙ NBC¡^«o¬O¬Û·í¦³¹ê¥Î©Ê¡A¨ä¿ëÃѮįà±`±`¤£¿éµ¹¨ä¥¦§ó½ÆÂøªº¿ëÃѾ¹¡C

In practice, we can apply 1D Gaussian functions for modeling the data of a specific class and a specific dimension. The training of a NBC can be summarized as follows:

¦b¹ê°µ¤W¡A§Ú­Ì³q±`°²³]¤@ºû¸ê®Æ©Ò¹ïÀ³ªºPDF¬O°ª´µ¾÷²v±K«×¨ç¦¡¡A¦b¦¹±¡ªp¤U¡A¹ïÀ³ªºNBC¨BÆJ¥i¥H»¡©ú¦p¤U¡G

  1. Assume that data in class i and dimension j is governed by a 1-dimensional Gaussian PDF:
    g(x, mi, si2) = [(2p)*si2]-0.5*exp[-(x-mi)2)/(2si2)]
    where mi is the mean value and si2 is the variance of the PDF. Given the dimension-j training data {x1, x2, ... xn} in class i, the optimum parameters in the sense of maximum likelihood can be expressed as follows:
    mi = (x1+x2+...+xn)/n
    si2 = [ (x1-mi)2 + (x2-mi)2 + ... + (xn-mi)2 ] /(n-1)
  2. Since the Gaussian PDF does not reflect the prior probability of class i (denoted by p(ci)), we usually use p(ci)g(x, mi, Si) to represent the probability density of a given vector x. In practice, p(ci) is computed as the number of entries of class i divided by the number of total entries in the training set.
  3. In application, for a given vector x of unknown class, the higher the probability density p(ci)g(x, mi, Si) is, the more likelihood that this vector belongs to class i.
  1. °²³]¨C¤@­ÓÃþ§Oªº¸ê®Æ§¡¬O¥Ñ d ºûªº°ª´µ¾÷²v±K«×¨ç¼Æ¡]Gaussian probability density function¡^©Ò²£¥Í¡G¡G
    gi(x, m, S) = (2p)-d/2*det(S)-0.5*exp[-(x-m)TS-1(x-m)/2]
    ¨ä¤¤ m ¬O¦¹°ª´µ¾÷²v±K«×¨ç¼Æªº¥­§¡¦V¶q¡]Mean vector¡^¡AS «h¬O¨ä¦@Åܲ§¯x°}¡]Covariance matrix¡^¡A§Ú­Ì¥i¥H®Ú¾Ú MLE¡A²£¥Í³Ì¨Îªº¥­§¡¦V¶q m ©M¦@Åܲ§¯x°} S¡C
  2. ­Y¦³»Ý­n¡A¥i¥H¹ï¨C¤@­Ó°ª´µ¾÷²v±K«×¨ç¼Æ­¼¤W¤@­ÓÅv­« wi¡C
  3. ¦b¹ê»Ú¶i¦æ¤ÀÃþ®É¡Awi*gi(x, m, S) ¶V¤j¡A«h¸ê®Æ x ÁõÄÝ©óÃþ§O i ªº¥i¯à©Ê´N¶V°ª¡C

When implementing the classifier, usually we do not calculate p(ci)g(x, mi, Si) directly since the exponential function is likely to introduce round-off errors. Instead, we compute the natural logarithm of the probability density, as follows:

¦b¹ê»Ú¶i¦æ¹Bºâ®É¡A§Ú­Ì³q±`¤£¥h­pºâ wi*gi(x, m, S) ¡A¦Ó¬O­pºâ log(wi*gi(x, m, S)) = log(wi) + log(gi(x, m, S))¡A¥H«KÁ׶}­pºâ«ü¼Æ®É¥i¯àµo¥ÍªººØºØ°ÝÃD¡]¦pºë½T«×¤£¨¬¡B­pºâ¯Ó®É¡^¡Alog(gi(x, m, S)) ªº¤½¦¡¦p¤U¡G

log[p(ci)g(x, mi, Si)] = log(p(ci)) - (d*log(2p) + log|Si|)/2 - (x-mi)TSi-1(x-mi)/2
The decision boundary between class i and j is represented by the following trajectory:
p(ci)g(x, mi, Si) = p(cj)g(x, mj, Sj).
Taking the logrithm of both sides, we have
log(p(ci)) - (d*log(2p) + log|Si|)/2 - (x-mi)TSi-1(x-mi)/2 = log(p(cj)) - (d*log(2p) + log|S|j)/2 - (x-mj)TSj-1(x-mj)/2
After simplification, we have the decision boundary as the following equation:
(x-mj)TSj-1(x-mj) - (x-mi)TSi-1(x-mi) = log{[|S|i p2(ci)]/[|S|j p2(cj)]}
where the right-hand side is a constant. Since both (x-mj)TSj-1(x-mj) and (x-mi)TSi-1(x-mi) are quadratic, the above equation represents a decision boundary of the quadratic form in the d-dimensional feature space.

In the following example, we shall use dimensions 3 and 4 of the Iris dataset for NBC:

¨Ò¦p¡A¦pªG¨Ï¥Î NBC ¨Ó¹ï IRIS ¸ê®Æªº²Ä¤Tºû¤Î²Ä¥|ºû¶i¦æ¤ÀÃþ¡A¥i¨Ï¥Î¤U¦C½d¨Òµ{¦¡¡G

Example 1: nbc01dataPlot.mDS = prData('iris'); DS.input=DS.input(3:4, :); % Only take dimensions 3 and 4 for 2d visualization plotOpt=1; % Plotting [nbcPrm, logLike, recogRate, hitIndex]=nbcTrain(DS, [], plotOpt); fprintf('Recog. rate = %f%%\n', recogRate*100); Recog. rate = 96.000000%

In the above example, each misclassified data point is labeled with a cross. Moreover, we use the number of entries of each class as the class prior probability for computing the probability density. In general, there are two ways to specify class prior probabilities:

¤W¹Ï¨q¥X¸ê®ÆÂI¡A¥H¤Î¤ÀÃþ¿ù»~ªºÂI¡]¤e¤e¡^¡C¯S§O»Ý­nª`·Nªº¬O¡A¦b¤W­zªºµ{¦¡½X¤¤¡A§Ú­Ì¥Î¨ì classWeight¡A³o¬O¤@­Ó¦V¶q¡A¥Î¨Ó«ü©w¨C¤@­ÓÃþ§OªºÅv­«¡A³q±`¦³¨âºØ°µªk¡G

We can plot the 1d PDFs for all classes and dimensions, together with the corresponding data points, as follows:

§Ú­Ì¥i¥Hµe¥X¨C­ÓÃþ§O¤Î¨C­Óºû«×ªº¤@ºûPDF¨ç¼Æ¡A¥H¤Î¨ä¹ïÀ³ªº¸ê®Æ¡A½Ð¨£¤U¦C½d¨Ò¡G

Example 2: nbcPlot00.mDS=prData('iris'); DS.input=DS.input(3:4, :); [nbcPrm, logLike, recogRate, hitIndex]=nbcTrain(DS); nbcPlot(DS, nbcPrm, '1dPdf');

For the above example, we can go one step further to plot the class-specific PDFs as 3D surfaces and display the corresponding contours, as follows:

§Ú­Ì¤]¥i¥H±N¨C­ÓÃþ§OªºPDF¨ç¼Æ¥H¤Tºû¦±­±§e²{¡A¨Ãµe¥X¨äµ¥°ª½u¡A½Ð¨£¤U¦C½d¨Ò¡G

Example 3: nbcPlot01.mDS=prData('iris'); DS.input=DS.input(3:4, :); [nbcPrm, logLike, recogRate, hitIndex]=nbcTrain(DS); nbcPlot(DS, nbcPrm, '2dPdf');

Based on the computed PDF for each class, we can plot the decision boundaries, as follows:

®Ú¾Ú³o¨Ç°ª´µ±K«×¨ç¼Æ¡A§Ú­Ì´N¥i¥Hµe¥X¨C­ÓÃþ§OªºÃä¬É¡A¦p¤U¡G

Example 4: nbcPlot02.mDS=prData('iris'); DS.input=DS.input(3:4, :); [nbcPrm, logLike, recogRate, hitIndex]=nbcTrain(DS); DS.hitIndex=hitIndex; % Attach hitIndex to DS for plotting nbcPlot(DS, nbcPrm, 'decBoundary');


Data Clustering and Pattern Recognition (¸ê®Æ¤À¸s»P¼Ë¦¡¿ë»{)