Tenth: Thoughts and problems of non-equilibrium classification and solutions

Source: Internet
Author: User

Objective

In the previous article, some classification algorithms were discussed. Then, one thing has been neglected, is the non-equilibrium classification problem.

There are two situations in which the sub-equilibrium classification

Case One: the number of positive and inverse cases varies greatly.

For example, analyze the normal samples and scam samples in the credit card information set. The normal sample is much more than a scam sample.

Scenario Two: The cost of classifying correctly/wrongly is different.

For example, to analyze the patient's physical examination data, we certainly hope that we do not miss a case. Therefore, the result of illness diagnosis is more serious than the result of disease diagnosis.

Such unbalanced classification results in an unscientific analysis of the classification quality by simply using the classification error rate.

This article will introduce some new parameters to measure the quality of the classification, tools. Based on these, the classification code can be optimized for more practical use of the classifier.

Tool One: Confusion matrix (confusion matrix)

First introduce several concepts:

1. TP: Zhenyang. is actually true for true predictions.

2. FN: false Yang. is to actually be false for true predictions.

3. FP: false Yin. is to actually be true for false predictions.

4. TN: True Yin. is the actual false prediction.

the detailed table (matrix format) that lists these parameters is the confusion matrix, as shown in:

This kind of form can help people to evaluate the quality of classification very well. However, for a program, or a machine, a hard indicator based on cost is needed.

Therefore, the following tool-Roc curve is the weapon in machine learning to deal with the problem of non-equilibrium classification.

Tool Two: ROC curve

The ROC curve is a planar two-dimensional graph. The transverse axis is the false Yang rate, and the longitudinal shaft is the Zhenyang rate.

Zhenyang rate = TP/(TP+FN). is actually true, the predictions are also true for the probabilities of the samples in all actual samples.

False positive rate = FP/(FP+TN). is actually false, predicting the probability of a sample being true in all samples that are actually false.

The direction of movement is changed according to the threshold value. Each point represents a threshold of Zhenyang rate and false positive rate.

As the example:

m

For the ROC model, the ratio of the blue segment in the figure to the total area of the horizontal axis-AUC can measure the performance of the entire classification, but remember that it cannot replace the observation of the entire line segment and the error rate.

The optimal value of the AUC theory is 1. If the weight of the "true" category is more important (like diagnosing a patient's illness), then the AUC should be between 0.5 and 1, and the closer to 1, the better.

Plotting ROC Curves

The ROC curve must be drawn first to obtain the predictive strength of the classifier results, which can be adjusted based on the predictive strength of the Zhenyang rate, the false positive rate.

The following code is the drawing code for the ROC chart, which can be encapsulated to be called at any time as needed.

1 #==========================================2 #Input:3 #predstrengths: Predictive strength4 #classlabels: Classification Results5 #Output:6 #ROC Chart for this classifier7 #==========================================8 defPlotroc (Predstrengths, classlabels):9     'plot the ROC chart for the classifier'Ten      One     ImportMatplotlib.pyplot as Plt A     #Current drawing node -Cur = (1.0,1.0) -     #AUC Statistics theYsum = 0.0 -     #the actual number of categories that are true -Numposclas = SUM (Numpy.array (classlabels) ==1.0) -     #x-axis Move step +Ystep = 1/float (numposclas); -     #y-axis move step +XStep = 1/float (len (classlabels)-Numposclas) A     #Predictive intensity sort (subscript sort) atSortedindicies =Predstrengths.argsort () -     #set up the canvas -Fig =plt.figure () - FIG.CLF () -Ax = Plt.subplot (111) -      in     #plotting ROC statistical images in descending order of prediction intensity -      forIndexinchsortedindicies.tolist () [0]: to         ifClasslabels[index] = = 1.0: +Delx = 0; Dely =Ystep; -         Else: theDelx = XStep; Dely =0; *Ysum + = Cur[1] $         Panax NotoginsengAx.plot ([cur[0],cur[0]-delx],[cur[1],cur[1]-dely], c='b') -Cur = (cur[0]-delx,cur[1]-dely) the          +Ax.plot ([0,1],[0,1],'b--') APlt.xlabel ('False Positive rate'); Plt.ylabel ('True Positive rate') thePlt.title ('ROC curve for AdaBoost horse colic detection system') +Ax.axis ([0,1,0,1]) - plt.show () $     Print "The area under the Curve is:", Ysum*xstep

Use the upper function to plot the ROC statistical histogram with the AdaBoost classifier (10 iterations) in the previous article (the dataset format is the same as before, and the content is randomly generated.) ):

  

Some other measurement scenarios

1. A cost function can be set up to calculate the cost value of the confusion matrix (all members are multiplied by the corresponding weights and then summed).

2. For the problem of the uneven number of samples (such as credit card fraud), the number of samples can be adjusted. For example, 5000 and true samples, 50 false samples, then from the 5,000 real samples only randomly selected 50 real samples.

Summary

The discussion of the classification algorithm was first adjourned.

Later there may be a learning article to share some of the further analysis of some of these algorithms, but now my gaze shifts from classification to the next topic in supervised learning- regression .

The second day of New Year's holiday is also almost over, the next year to add strength, good learning at the same time a lot of attention to exercise, to meet a challenging future.

Tenth: Thoughts and problems of non-equilibrium classification and solutions

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.