Preface
Recently on the data mining learning process, learn to naive Bayesian operation Roc Curve. It is also the experimental subject of this section, the calculation principle of ROC curve and if statistic TP, FP, TN, FN, TPR, FPR, ROC area and so on. The ROC area is often used to assess the accuracy of the model, generally think the closer to 0.5, the lower the accuracy of the model, the best state is close to 1, the correct model area is 1. The following is an introduction:
ROC The calculation principle of the area of curve
The working process frame diagram of naive Bayesian method
second, using Weka tools, to find the training of pre-processing data
1, using the naïve Bayesian algorithm to process the Weather.nominal.arff file, and then select Temperature Open, select Edit to find preprocessing data as shown in 1-1:
Figure 1-1 Full weather Data infographic
2, according to the above training tuples to calculate the prior probability of each class, the formula is P (C)
2.1. Calculate the prior probability
P (Play=yes) =9/14=0.643
P (Play=no) =5/14=0.357
2.2, calculate the conditional probability, according to the formula P (x| C
3, then according to the formula (showing one of the tuples for probability classification x= (outlook=sunny,temperature=mid,humidity=yes,windy=sunny)) substituting the above data:
3.1. P (x|paly=yes)=p (outlook=sunny|play=yes) * p (temperature=mid|play=yes) * p (humidity=yes|play=yes) * p ( Outlook=sunny|play=yes))
The same calculation:P (X|paly=no)
3.2, through the comparison of results, the meta-group play
3.3, then the calculation of the probability
4, and then cite the data mining concepts and technology in the P244 page method, 1-2 shows:
Figure 1-2 Returning the Data sample
For example non-real data: Because the probability of each tuple can be computed based on 3.3, the class is sorted by probability size. The real data of TP, FP, TN and FN are then based on the prior probability, and the data of TPR and FPR are not difficult to be calculated.
5, again quoting the "Data Mining concepts and technology" in the P245 page knowledge, with FPR as the x-axis, TPR as the y-axis, plotting the ROC curve of the data, the data in 4 respectively into into, get 1-3 shows:
Figure 1-3 returning the data graph
According to the shape, using the mathematical method to obtain the ROC curve area of 0.9222. Then use Weka to view the tool data, 1-4 shows:
Figure 1-4 Weka Return Data
。
Resources:
[1] Data mining using Weka (http://www.cnblogs.com/bluewelkin/p/3538599.html)
[2] Weka use (basic configuration + spam filtering + Cluster Analysis + association Mining) (http://www.cnblogs.com/bitpeach/p/3770606.html)
[3] The calculation method of the area under the ROC curve (Http://wenku.baidu.com/view/3d2ac9202f60ddccda38a07a.html?re=view)
[4] Jiawei han, data mining concepts and techniques, p243-p245.
[5] Classification (data mining) (Http://wenku.baidu.com/link?url=EdT7Xxs-a_ 423om-48ih-kxtteprxeejci0-xsm1yk9xbkzgtvwqyiznpzwua8a-dlf-krehls63u9pxxxudjfcsdmbpz2kex5bhwtyswhe&qq-pf-to =PCQQ.C2C)
"Data mining" naive Bayesian algorithm for calculating the area of ROC curves