Classification Model Assessment:
indicator |
Description |
Scikit-learn function |
Precision |
Precision Degree |
From Sklearn.metrics import Precision_score |
Recall |
Recall rate |
From Sklearn.metrics import Recall_score |
F1 |
F1 value |
From Sklearn.metrics import F1_score |
Confusion Matrix |
Confusion matrix |
From Sklearn.metrics import Confusion_matrix |
ROC |
ROC Curve |
From Sklearn.metrics Import Roc |
Auc |
The area under the ROC curve |
From Sklearn.metrics Import AUC |
Regression model Evaluation:
indicator |
Description |
Scikit-learn function |
Mean Square Error (MSE, RMSE) |
Mean variance |
From Sklearn.metrics import Mean_squared_error |
Absolute Error (MAE, RAE) |
Absolute error |
From sklearn.metrics import Mean_absolute_error, Median_absolute_error |
R-squared |
R-Squared value |
From Sklearn.metrics import R2_score |
Roc and AUC definitions
The ROC full name is "The working characteristics of the subjects" (Receiver operating characteristic). The size of the ROC curve is AUC (area Under the Curve). AUC is used to measure the "Two classification problem" machine Learning algorithm performance (generalization ability). Calculate the key concepts the ROC needs to know
First, explain the common concepts in several two classification questions: True Positive, False Positive, True Negative, False Negative. They are differentiated according to the combination of the real category and the prediction category.
Suppose there are a batch of test samples, these samples have only two categories: positive and reverse. The machine learning algorithm predicts the category as follows (the left half of the prediction category is the positive example, the right half of the prediction category is the inverse), while the real positive example category in the sample is in the upper part, and the lower part is the true counter example. The predictive value is a positive example, with the P (Positive) forecast as the inverse, the N (Negative) predictive value the same as the real value, the T (true) predicted value contrary to the true value, and F (False)
TP: The prediction category is P (positive example), and the real category is P FP: the prediction category is P, the real class is n (inverse) TN: The prediction category is n, the real category is N FN: the prediction category is n, the real category is P
The total number of true positive case categories in the sample is TP+FN. TPR is true Positive RATE,TPR = tp/(TP+FN).
Similarly, the total number of true counter cases in the sample is Fp+tn. FPR is False Positive rate,fpr=fp/(TN+FP).
There is also a concept called "truncation point". After the machine learning algorithm predicts the test sample, it can output the similarity probability of each test sample to a certain category. For example, T1 is a P-class probability of 0.3, generally we think that the probability is lower than 0.5,t1 belongs to the category N. Here's 0.5, is the "truncation point."
To sum up, for the ROC, the most important three concepts are TPR, FPR, and truncation points.
The truncation point takes different values, and the results of TPR and FPR are different. The results of the corresponding TPR and FPR of the truncated points under different values are drawn in the two-dimensional coordinate system, which is the ROC curve. The horizontal axis is expressed in FPR. Sklearn Calculation Roc
Sklearn gives an example of the ROC calculation [1].
y = Np.array ([1, 1, 2, 2])
scores = Np.array ([0.1, 0.4, 0.35, 0.8])
FPR, TPR, thresholds = Metrics.roc_curve (Y, SC Ores, pos_label=2)
1 2 3 1 2 3
by calculating, the resulting results (TPR, FPR, truncated points) are
FPR = Array ([0., 0.5, 0.5, 1.])
TPR = Array ([0.5, 0.5, 1, 1.])
Thresholds = Array ([0.8, 0.4, 0.35, 0.1]) #截断点
1 2 3 1 2 3
The FPR and TPR in the results are drawn into two-dimensional coordinates, the ROC curve is as follows (blue Line representation), and the area of the ROC curve is represented by AUC (light yellow shaded section).
Detailed calculation Process
The above example gives the following data
y = Np.array ([1, 1, 2, 2])
scores = Np.array ([0.1, 0.4, 0.35, 0.8])
1 2 1 2
What is the process of calculating TPR,FPR with this data? 1. Analysis of Data
Y is a one-dimensional array (a true classification of the samples). Array values represent categories (altogether there are two classes, 1 and 2). We assume that 1 in Y represents a counter example, and 2 represents a positive example. Rewrite Y to:
Y_true = [0, 0, 1, 1]
1 1
Score is the probability that each sample belongs to a positive example. 2. Sort the data for score
Sample |
probability of predicting a P (score) |
Real Category |
Y[0] |
0.1 |
N |
Y[2] |
0.35 |
P |
Y[1] |
0.4 |
N |
Y[3] |
0.8 |
P |
3. Take the truncation point to the score value in turn
The results of TPR and FPR are computed when the truncated point is sequentially 0.1,0.35,0.4,0.8. 3.1 truncation point is 0.1
Note that as long as the score>=0.1, its prediction category is the positive example.
At this point, because the score of the 4 samples is greater than or equal to 0.1, all samples have a forecast category of P.
scores = [0.1, 0.4, 0.35, 0.8]
y_true = [0, 0, 1, 1]
y_pred = [1, 1, 1, 1]
1 2 3 1 2 3
TPR = tp/(TP+FN) = 1
FPR = fp/(TN+FP) = 1 3.2 Truncation point is 0.35
Description as long as the score>=0.35, its prediction category is P.
At this point, because 4 samples of score have 3 greater than or equal to 0.35. Therefore, the prediction classes for all samples have 3 p (2 predictions correct, 11 predictive errors), and 1 samples are predicted to be N (predicted correctly).
scores = [0.1, 0.4, 0.35, 0.8]
y_true = [0, 0, 1, 1]
y_pred = [0, 1, 1, 1]
1 2 3 1 2 3