Tenth: Thoughts and problems of non-equilibrium classification and solutions

Last Update:2017-01-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Objective

In the previous article, some classification algorithms were discussed. Then, one thing has been neglected, is the non-equilibrium classification problem.

There are two situations in which the sub-equilibrium classification

Case One: the number of positive and inverse cases varies greatly.

For example, analyze the normal samples and scam samples in the credit card information set. The normal sample is much more than a scam sample.

Scenario Two: The cost of classifying correctly/wrongly is different.

For example, to analyze the patient's physical examination data, we certainly hope that we do not miss a case. Therefore, the result of illness diagnosis is more serious than the result of disease diagnosis.

Such unbalanced classification results in an unscientific analysis of the classification quality by simply using the classification error rate.

This article will introduce some new parameters to measure the quality of the classification, tools. Based on these, the classification code can be optimized for more practical use of the classifier.

Tool One: Confusion matrix (confusion matrix)

First introduce several concepts:

1. TP: Zhenyang. is actually true for true predictions.

2. FN: false Yang. is to actually be false for true predictions.

3. FP: false Yin. is to actually be true for false predictions.

4. TN: True Yin. is the actual false prediction.

the detailed table (matrix format) that lists these parameters is the confusion matrix, as shown in:

This kind of form can help people to evaluate the quality of classification very well. However, for a program, or a machine, a hard indicator based on cost is needed.

Therefore, the following tool-Roc curve is the weapon in machine learning to deal with the problem of non-equilibrium classification.

Tool Two: ROC curve

The ROC curve is a planar two-dimensional graph. The transverse axis is the false Yang rate, and the longitudinal shaft is the Zhenyang rate.

Zhenyang rate = TP/(TP+FN). is actually true, the predictions are also true for the probabilities of the samples in all actual samples.

False positive rate = FP/(FP+TN). is actually false, predicting the probability of a sample being true in all samples that are actually false.

The direction of movement is changed according to the threshold value. Each point represents a threshold of Zhenyang rate and false positive rate.

As the example:

For the ROC model, the ratio of the blue segment in the figure to the total area of the horizontal axis-AUC can measure the performance of the entire classification, but remember that it cannot replace the observation of the entire line segment and the error rate.

The optimal value of the AUC theory is 1. If the weight of the "true" category is more important (like diagnosing a patient's illness), then the AUC should be between 0.5 and 1, and the closer to 1, the better.

Plotting ROC Curves

The ROC curve must be drawn first to obtain the predictive strength of the classifier results, which can be adjusted based on the predictive strength of the Zhenyang rate, the false positive rate.

The following code is the drawing code for the ROC chart, which can be encapsulated to be called at any time as needed.

1 #==========================================2 #Input:3 #predstrengths: Predictive strength4 #classlabels: Classification Results5 #Output:6 #ROC Chart for this classifier7 #==========================================8 defPlotroc (Predstrengths, classlabels):9     'plot the ROC chart for the classifier'Ten      One     ImportMatplotlib.pyplot as Plt A     #Current drawing node -Cur = (1.0,1.0) -     #AUC Statistics theYsum = 0.0 -     #the actual number of categories that are true -Numposclas = SUM (Numpy.array (classlabels) ==1.0) -     #x-axis Move step +Ystep = 1/float (numposclas); -     #y-axis move step +XStep = 1/float (len (classlabels)-Numposclas) A     #Predictive intensity sort (subscript sort) atSortedindicies =Predstrengths.argsort () -     #set up the canvas -Fig =plt.figure () - FIG.CLF () -Ax = Plt.subplot (111) -      in     #plotting ROC statistical images in descending order of prediction intensity -      forIndexinchsortedindicies.tolist () [0]: to         ifClasslabels[index] = = 1.0: +Delx = 0; Dely =Ystep; -         Else: theDelx = XStep; Dely =0; *Ysum + = Cur[1] $         Panax NotoginsengAx.plot ([cur[0],cur[0]-delx],[cur[1],cur[1]-dely], c='b') -Cur = (cur[0]-delx,cur[1]-dely) the          +Ax.plot ([0,1],[0,1],'b--') APlt.xlabel ('False Positive rate'); Plt.ylabel ('True Positive rate') thePlt.title ('ROC curve for AdaBoost horse colic detection system') +Ax.axis ([0,1,0,1]) - plt.show () $     Print "The area under the Curve is:", Ysum*xstep

Use the upper function to plot the ROC statistical histogram with the AdaBoost classifier (10 iterations) in the previous article (the dataset format is the same as before, and the content is randomly generated.) )：

Some other measurement scenarios

1. A cost function can be set up to calculate the cost value of the confusion matrix (all members are multiplied by the corresponding weights and then summed).

2. For the problem of the uneven number of samples (such as credit card fraud), the number of samples can be adjusted. For example, 5000 and true samples, 50 false samples, then from the 5,000 real samples only randomly selected 50 real samples.

Summary

The discussion of the classification algorithm was first adjourned.

Later there may be a learning article to share some of the further analysis of some of these algorithms, but now my gaze shifts from classification to the next topic in supervised learning- regression .

The second day of New Year's holiday is also almost over, the next year to add strength, good learning at the same time a lot of attention to exercise, to meet a challenging future.

Tenth: Thoughts and problems of non-equilibrium classification and solutions

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Tenth: Thoughts and problems of non-equilibrium classification and solutions

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Tenth: Thoughts and problems of non-equilibrium classification and solutions

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support