Data Mining Series (7) Classification algorithm evaluation

Last Update:2017-02-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. INTRODUCTION

There are many classification algorithms, and different classification algorithms are used in many different variants. Different classification algorithms have different specific, the effect on different data sets is different, we need to choose the algorithm according to the specific task, how to choose the classification, how to evaluate the quality of a classification algorithm, the previous introduction of the decision tree, we mainly use the correct rate (accuracy) to evaluate the classification algorithm.

Correct rate is really a very good intuitive evaluation index, but sometimes the correct rate is high and can not represent an algorithm is good. For example, a certain area of the earthquake prediction, assuming that we have a bunch of characteristics as the attributes of seismic classification, the category is only two: 0: No earthquakes, 1: earthquakes. An indiscriminate classifier that divides the categories into 0 for each test case, then it can reach 99% of the correct rate, but when the real earthquake comes, the classifier is unaware that the human losses are enormous. Why 99% of the correct rate classifier is not what we want, because the data distribution is uneven, the category 1 of the data is too small, completely wrong class 1 still can achieve a very high accuracy but ignore our concerns. Next, the evaluation index of the classification algorithm is introduced in detail.

II. Indicators of Evaluation

1. Several common terms

Here we first introduce a few common model evaluation terms, now assume that our classification goal is only two categories, the positive example (positive) and the negative example (negtive) are:

1 true Positives (TP): The number of cases that are correctly divided into positive examples, that is, the number of instances (sample numbers) that are actually positive examples and are divided into positive cases by classifiers;

2) false Positives (FP): The number of errors divided into positive cases, that is, the number of instances that are actually negative cases but classified as positive examples;

3) False negatives (FN): The number of wrongly divided into negative cases, that is, the number of instances in which the classifier is divided into negative cases and is actually a positive example;

4 true Negatives (TN): The number of instances that are correctly divided into negative cases, which are actually negative cases and are divided into negative cases by the classifier.

The above figure is the confusion matrix of these four terms, I only know the FP is called pseudo-positive rate, other how to call is unknown. Note that P=TP+FN represents the number of samples that are actually positive examples, I have mistakenly thought that the actual number of samples should be TP+FP, where only remember true, false describes whether the classifier is correct, Positive, negative is the classifier's classification results. If the positive example is 1, the negative example is 1, that is, positive=1, Negtive=-1, and 1 indicates false true,-1, then the actual class is =TF*PN,TF true or false,pn to positive or negtive. For example, the actual class standard =1*1=1 for True positives (TP) is a positive example, the actual class standard for false positives (FP) = (-1) *1=-1 is negative, and false negatives (FN) the actual class Mark = (-1) * (-1) =1 is the positive example, The actual class superscript =1* (-1) =-1 of True negatives (TN) is a negative example.

2. Evaluation Index

1) correct rate (accuracy)

The correct rate is our most common evaluation indicator, accuracy = (TP+TN)/(P+n), which is easy to understand, is divided into the number of samples divided by all the number of samples, generally speaking, the higher the accuracy, the better the classifier;

2 Error Rate (rate)

The error rate is the opposite of the correct rate, which describes the proportion of the classifier being wrong, the error rate = (FP+FN)/(P+n), and for an instance, the separation and error are mutually exclusive events, so accuracy =1-error rate;

3) sensitivity (sensitive)

sensitive = tp/p, which represents the proportion of all positive cases being divided, measures the classifier's ability to recognize positive cases;

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Data Mining Series (7) Classification algorithm evaluation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Data Mining Series (7) Classification algorithm evaluation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support