Unbalanced classification of notes in Machine Learning Practice

Last Update:2014-08-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Generally, the error rate of classification results can be used as the criterion for determining the classifier. However, when the number of positive examples and the number of inverse examples are not equal during Classifier Training, this kind of evaluation criteria will cause problems. This phenomenon is also known as unbalanced classification. The following measures are available.

(1) Accuracy <precise> and recall rate <Recall>

As shown in: accuracy refers to the proportion of the predicted real positive examples to all real positive examples, which is equal to TP/(TP + FP ), the recall rate refers to the percentage of predicted real positive examples to all real positive examples, which is equal to TP/(TP + FN ). Generally, we can easily construct a classifier with a high accuracy rate or a high recall rate, but it is difficult to guarantee both. If any sample is judged as a positive sample, the recall rate reaches, and the accuracy is very low. Constructing a classifier that maximizes both the accuracy and recall rate is challenging. In this case, we can use the F-score = precise * recall/(precise + recall) quantity to measure. The larger the value, the better.

(2) ROC curve

Def plotroc (predstrengths, classlabels): Import matplotlib. pyplot as PLT cur = (1.0, 1.0) # cursor ysum = 0.0 # variable to calculate AUC numposclas = sum (Array (classlabels) = 1.0) ystep = 1/float (numposclas); xstep = 1/float (LEN (classlabels)-numposclas) sortedindicies = predstrengths. argsort () # get sorted index, it's reverse fig = PLT. figure () # these three lines of code are used to build the brush fig. CLF () AX = PLT. subplot (111) # loop through all the values, drawing a line segment at each point for index in sortedindicies. tolist () [0]: If classlabels [Index] = 1.0: delx = 0; dely = ystep; else: delx = xstep; dely = 0; ysum + = cur [1] # Draw line from cur to (cur [0]-delx, cur [1]-dely) ax. plot ([cur [0], cur [0]-delx], [cur [1], cur [1]-dely], c = 'B ') cur = (cur [0]-delx, cur [1]-dely) ax. plot ([0, 1], [0, 1], 'B --') PLT. xlabel ('false positive rate'); PLT. ylabel ('true positive rate') PLT. title ('roc curve for AdaBoost horse colic detection system') ax. axis ([0, 1, 1]) PLT. show () print "the area under the curve is:", ysum * xstep

Small village chief source: http://blog.csdn.net/lu597203933 welcome to reprint or share, but please be sure to declare the source of the article. (Sina Weibo: small village chief Zack. Thank you !)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Unbalanced classification of notes in Machine Learning Practice

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support