Tags: example represents balanced relationship intent add informed resolution measure
2. Model evaluation and Selection 2.1 experience error and overfittingDifferent learning algorithms and different models of different parameters are related to the problem of model selection, which is related to two indicators, which is empirical error and overfitting.
1) Experience Error
Error Rate (errorrate): The number of samples that are categorized incorrectly is the proportion of the total number of samples. If there is a sample classification error in M samples, then the error rate e=a/m, corresponding, 1a/m is called precision (accuracy), that is, the accuracy =1error rate.
Error: The difference between the actual predicted output of the learner and the true output of the sample. Training error or experience Error: the error of the learner in the training set; Generalization error: The error of the learner in the new sample.
Naturally, ideally, the smaller the generalization error of the learner, the better, but in reality, how the new sample is not known, can do is to the training set of experience error minimization.
So, is the classifier with the smallest error and even the accuracy of 100% in the training set the best predictor of the new sample?
We can design a perfect classifier for a known training set, but the new sample is unknown, so the same learner (model) behaves well on the training set, but not necessarily as good in the new sample.
2) over fitting
The learner first learns from the training sample the Universal laws applicable to all potential samples and is used to correctly predict the categories of new samples. This can happen in two cases, which results in a wellbehaved learner on the training set that does not necessarily perform well on the new sample.
Over fitting (overfitting): The learner increases the individual characteristics of the training sample to the general characteristics of all samples, resulting in decreased generalization performance.
UnderFitting (underfitting): The learner failed to learn from the training sample the general characteristics of all samples.
In layman's words, overfitting is the generalization of the individual in the training sample, while the lack of fitting is not a general feature. One is overkill; one is horseshoes. Overfitting is too strong a learning ability, the lack of fit is too weak learning ability.
Underfitting can be overcome by adjusting the model parameters, but overfitting does not completely avoid it. The problem of machine learning is NPhard, the effective learning algorithm can be completed in polynomial time, if it can completely avoid overfitting, then the optimal solution can be obtained by minimizing the experience error, so that the structural proof is P=NP, but the actual P≠NP and overfitting is unavoidable.
Summing up, in the model selection, it is ideal to evaluate the generalization error of the candidate model, choose the model with the least generalization error, but the generalization error can not be obtained directly, it needs to be evaluated by the training error, but the training error is not fit for the evaluation criterion, so how to evaluate and choose the model?
2.2 Evaluation methodsSince the evaluation model cannot choose the generalization error or the training error, the test error can be selected. The socalled test error is to set up a test sample set to test the learner's ability to predict new samples as a generalization error approximation.
The test set is also a separate sample from the real sample distribution, and the training set is mutually exclusive. It is a reasonable method to evaluate the model by the test error of the test set as the approximation of the generalization error. Separate the DataSet d={(x1,y1), (x2,y2),..., (XM,YM)}, generate the training set S and test set T, generate the model from the training set, and apply the test set evaluation model. A good example of this is that the training set is equivalent to a test question, and the test set is equivalent to an exam question.
Now we focus the problem on the test error of the test set to evaluate the model. What's important is how to divide the dataset D into training sets and test sets to get test errors?
1) Leave the method
The setaside method (Holdout) divides the dataset D into two mutually exclusive collections, one of which is used as the training set S and the other as the test set T, or d=sut,s∩t=. In the model trained on S, the test error is evaluated by T as an approximate estimate of the generalization error.
Take the two classification task as an example. Suppose d contains 1000 samples, divides it into 700 samples of the training set S and 300 samples of the test set T. With the S training, the model has 90 sample classification errors on T, then the test error is 90/300=30%, correspondingly, the accuracy is 130%=70%.
Leaving the method is to divide the data set into a different proportion of two, where there are two key points, one is how to divide? What is the ratio of the other one?
How to divide it? The Division of Training set and test set is to maintain the consistency of data distribution. Absent Stratified sampling, keeping the class proportions of the sample similar, that is, the sample of the various types of s and T distribution to close, such as a class of samples of the proportion is S:t=7:3, then the B category should be close to 7:3 of this distribution.
On top of layered sampling, there are different partitioning strategies that lead to different training sets and test sets. Obviously, the estimation results obtained by single use of the setaside method are not stable and reliable, and the average number of random partitions and repeated tests are used to evaluate the results.
How much does s and t separate? If the training set S too much and the test set T is too small, the larger the S is closer to D, then the trained model is closer to the D training model, but t small, the evaluation results may not be stable and accurate; if the training set S is small and the test set T is too large, s and D gap is too big, s training model will The model can be significantly different from the model D, thus reducing the assurance of the evaluation results (fidelity). S and t each divide how much, there is no perfect solution, the usual practice is 28 open.
2) Crossvalidation method
The crossvalidation method (Crossvalidation) divides the dataset D into Ksized mutually exclusive subsets, which are d=d1ud2u ... Udk,di∩dj= (i≠j); Each subset Di maintains the consistency of the data distribution as much as possible, i.e., from D through stratified sampling. During training, each time a subset of k1 is used as the training set, and the remaining subset is used as the test set, the K training set and test set can be obtained for k training and testing, and finally the mean value of K test results is returned. The k value determines the stability and fidelity of the results of the crossvalidation evaluation, and is also referred to as Kfold crossvalidation or Kfold crossvalidation, with K constant values of 10, 5, 20, and 10 percent crossvalidation as follows:
As with the setaside method, the dataset D is divided into k subsets, which are divided into different ways, in order to reduce the difference between sample partitioning, Kfold crossvalidation is usually randomly used to repeat the Ptimes, and the result is the mean value of the PKfold crossvalidation results, such as 10 times 10 percent crossvalidation.
The typical partitioning exception is one method (Leaveoneout,loo), assuming that dataset D contains m samples, so that k=m, that is, each subset contains only one sample. This special case, not affected by the method of random sample partitioning, and the training set S is only less than the data set D, the actual training model and the expected evaluation of D trained model similar, the evaluation results are relatively prepared. Of course, the problem is a subset of samples, once the sample is too large, the cost of training the model is extremely large, and its evaluation results may not be more accurate than other methods.
In fact, all algorithms are so, has its advantages have its shortcomings, each has a suitable occasion, in line with the costeffective principle. For example, to keep a law, in order to improve the theoretical accuracy, and at the expense of relatively clear and large costs, the benefits of whether it is desirable, it depends on the occasion.
3) Selfhelp method
In the method of retention and crossvalidation, the sample number of training set S is less than that of DataSet D, which results in the deviation of the model and evaluation result due to the different size of sample. Although s only a small sample, but the size of the calculation is huge. So is there a way to avoid the impact of sample size and efficiently calculate it?
Selfservice method to get training sets and test sets based on selfservice sampling. Given a dataset containing M samples, how can I generate Datasets d ' through selfservice sampling (repeatable sampling or with back sampling)? The basic process for selfservice sampling is to randomly select a sample from D, put it in d ', and then put the sample back into the initial dataset D so that the sample is still likely to be picked up at the next sampling time, and after repeated mtimes, the data set containing M samples is obtained d ', scale and D, the difference is, Some samples in D ' may be duplicated and some samples may not appear. The probability that a sample has not been picked in Mtime selfservice sampling is (11/m) m and the limit is obtained:
That is, through selfservice sampling, approximately 36.8% of the sample in the initial dataset D does not appear in the sample DataSet d '. After selfsampling, sample data set d ' as the training set S=d ' as a test set (not in D ' as a Test set) is sampled in the same samples size as DataSet D. Thus, the model that was actually evaluated (training set s) was used with the same sample size (Msamples) as the model expected to evaluate (data set D), with a total of about 36.8% Sample (not in sample DataSet d ') is used as a test set T for testing, resulting in test results, called outofpackage estimates (outofbag estimate).
Each algorithm has its own use occasions, and is not omnipotent and efficient. The data set d ' generated by selfservice selfhelp sampling also changes the distribution of the initial dataset D, and also introduces estimation deviations. The selfhelp method is suitable for small datasets and difficult to effectively divide training sets and tests. The setup and crossvalidation methods are more common when the initial data volume is sufficient.In order to evaluate the generalization performance of the model (learner) better, the approximate test error is proposed to evaluate the generalization error, and the method of data set partitioning is derived, which is the retention method, crossvalidation method and selfhelp method. In fact, the algorithm needs to be adjusted, different configuration of parameters, the performance of the model will be a certain difference. In the model evaluation and selection, in addition to the selection algorithm and data set partitioning method, the algorithm parameters need to be set or adjusted. Each algorithm has a parameter setting space, assuming that the algorithm has 3 parameters, each parameter has 5 optional values, for each set of training set/test set of 53 = 125 models need to be examined.
In the actual learning process, for a given data set containing M samples, the learning algorithm and its parameters are selected first, then the data set is trained and tested until the algorithm and parameters are selected, and then the data set D is used to retrain the model. When we study the generalization performance of different algorithms, we use the discriminant effect on the test set to evaluate the generalization ability of the model in practice, and divide the training data into training set and verification set, and make model selection and parameter adjustment based on the performance of the verification set.
Comb the next few points: 1) The data set is divided into: training set, validation set, test set, 2) training set is used to train the model, the validation set is used for model selection and parameters, the test set is used to approximate the generalization error; 3) The model evaluation method has the setting method, crossvalidation method and selfhelp method.The evaluation method in the course of training is used to judge the generalization performance of learners, and it needs to be examined by performance measurement. In other words, what model to choose, through the training set, validation set, test set to evaluate the selected and output, and the output of the model, in the test concentration of the real generalization ability, need to be measured by performance measurement tools. In this understanding, based on the determination of approximate generalization error of test error, the model which is output by dividing the data set into training set, validating set, test set and selecting different evaluation methods and adjusting algorithm parameters needs to be quantified and evaluated by means of performance measurement.
Different performance metrics, when comparing different model capabilities, result in different judgments, because the model is good or bad. The actual model is good or bad, depending on the algorithm and data, depending on the training of the parameters and experimental evaluation method, also depends on the actual data of the current task.
When the model is trained, a sample set of d={(X1,y1), (x2,y2),..., (XM,YM)} is given, and Yi is the true mark of Example XI, which evaluates the performance of the learner F and compares the learner's predictive result F (x) with the true Mark Y.
The most common performance metric for predicting regression tasks is the mean square error (mean squared error):Real situation 
Forecast results 

Positive example 
Counter Example 

Positive example 
TP (real example) 
FN (False counter example) 
Counter Example 
FP (False positive example) 
TN (True counter example) 
The precision ratio p and recall r are respectively defined as:
p=tp/(TP+FP)
r=tp/(TP+FN)
Precision and recall is a pair of contradictory measures, generally speaking, the rate is high, the recall tends to be low, and when the recall rate is high, the precision is often low. In the information retrieval, the precision is the amount of information retrieved is the user's interest, recall is the user is interested in how much information is retrieved. The precision denominator contains information that is not of interest to the user, but is still predicted to be the user's interest and is retrieved, and the recall denominator contains information that is of interest to the user, but is discarded for the purposes of being expected to be interested in the user.
The sample can be sorted according to the prediction result of the learner, and the first is the sample that the learner thinks is most likely a positive example, and the last is the one that the learner thinks is the most unlikely to be a positive example. In order to predict the sample as a positive example, we can calculate the current recall and precision ratio, and take the algorithm rate as the longitudinal axis, the recall rate as the transverse structure precisionrecall curve, or the pr curve for short.
The Pr curve is nonmonotonic and nonsmooth. The Pr curve can be used to evaluate the merits and demerits of learners. If the pr curve of a learner is fully wrapped by the pr curve of another learner, the performance of the latter is better than the former. If the curve of two learners crosses, then the size of the area is compared, the large area of the expression recall and precision double high is excellent, but not easy to calculate the area of the curve (not smooth), so through the balance points (breakeven point, referred to as BEP) to measure. BEP is the point at which the precision of the coordinate is equal to the recall, the greater the value of the balance point, the better the learner.
Using a simple diagram, the red dot is three pr curve of the BEP point, the curve of the learner A is C, C is excellent, and C and B Cross, with the area calculation is difficult to estimate, but the C BEP value is greater than B, so C is relatively good.
ERP is too simplistic to define F1 constants to compare the performance of the Learner Pr curve:
f1=2*p*r/p+r=2*tp/(total sample +tptn)Different applications, the recall rate and precision of the focus of different, such as in the product recommendation, as little as possible to disturb the user, the content of the hope is the user is interested in, precision is more important; In the case of fugitive information retrieval, it is more important to recall as little as possible. Different preferences for precision and recall can be fβ in the general form of the F1 metric, defined as:
fβ= (1+β2) *p*r/((β2*p) +r)
Β>0 measure the relative importance of recall to precision ratio, β=1 is the standard f1;β>1 when preference recall, β<1 preference precision.
F1 is a harmonic averaging (harmonic mean) definition based on precision and recall:
1/f1=1/2* (1/P+1/R)
Fβ is a weighted harmonic average:
1/FΒ=1/(1+β2) * (1/P+Β2/R)
The harmonic averages pay more attention to smaller values than arithmetic averages (p+r)/2 and geometric averages.
Geometric averages: Nsquare root of the number of times that the number of a continuous product of the nth data.
Arithmetic averages: The average number of algebras of a set of data and the number of items divided by the data.
Harmonic averages: The reciprocal of a set of data and the reciprocal of the number of items except the data.
Squared average: The sum of squares of a set of data divided by the number of items in the data.
For the same data, the sum of ≤ geometry ≤ arithmetic ≤ squared.
To conduct multiple training and testing, training and testing on multiple datasets, and performing multiclassification tasks with a mix of confusion matrices for each of the 22 categories, these situations result in multiple confusion matrices that require estimating the global performance of the algorithm, i.e., the accuracy and recall on the N two classification confusion matrices to evaluate the model performance. This has two methods, one is to get P and r again average, and the other is directly to TP, FP, TN, FN averaged after the P and R values, respectively, macro recall and macro precision, microrecall and microprecision.
3) Roc and AUC
Many learning periods produce a real or probabilistic prediction for a test sample, and then compare the predicted value with the classification threshold, and if it is greater than the threshold, classify the positive class, or the inverse class. The actual or probabilistic results of a test sample are sorted, most likely the first of the positive examples, and the most unlikely to be the last of the positive examples. The classification process is equivalent to dividing the sample into two parts in this sort, the first part being a positive case, and the latter part being the inverse example.
In different application tasks, we can use different truncation points according to the task demand, select the preposition in the sorting to cut attention to the precision, and select the postposition to truncate the recall. The quality of the sorting itself embodies the expected generalization performance of the learner under different tasks, or, generally speaking, the generalization performance is good or bad. The ROC curve is a performance measurement tool that considers expected generalization performance and is suitable for evaluation of learners that produce real or probabilistic predictions.
ROC (receiveroperating characteristic, subject work characteristics), according to the prediction results of the learner to sort the sample, according to the order of the sample as a positive example to predict, each time to calculate two important values, respectively, as a horizontal, ordinate, The ROC curve is obtained.
The longitudinal axis of the ROC curve is the true example rate (the Positive RATE,TPR), and the horizontal axis is the false positive rate (falsepositive RATE,FPR), respectively, defined as:True example rate: tpr=tp/(TP+FN), the real case number and the actual positive example number ratio, how many real positive examples of accurate prediction;
False positive Example rate: fpr=fp/(TN+FP), the predicted false positive case number and the actual counterexample number ratio, how many counterexamples are predicted as a positive example;It can be seen that the diagonal line of the ROC curve corresponds to the stochastic guessing model, while the point (0,1) corresponds to the ideal model that precedes all the inverse examples, and the true example rate is 100%.
In the actual task, the sample is finite, so it is not possible to produce a smooth ROC curve, but an approximate ROC curve with a toothed shape. A finite sample of the ROC graph plotting method: given m+ and Mcounter examples, according to the learner prediction results of the sample sorting, began to set the classification threshold to the maximum (all samples are predicted as a counterexample), at this time the real example rate and false positive rate are 0, at coordinates (0,0) mark a point; The classification thresholds are set to the predicted values of each sample sequentially (each example is divided into a positive example), and the actual and false positive rates are solved, and a point is marked at the corresponding coordinates. Set the previous marker point coordinates (x, y), if the current is a real example, the coordinates of the corresponding comparison point is (x,y+1/m+), if the current is a false positive example, then the coordinates of the corresponding marker point is (x+1/m,y), and finally the adjacent points are connected together by a line segment to approximate the ROC curve.
Then, it is natural to say how to compare the merits of the learning device with ROC curve? Similar to the Pr curve, if the ROC curve of a learner is wrapped by the curve of another learner, the performance of the latter is better than the former. If the ROC curve of the two learners crosses, it is necessary to compare the area under the ROC curve, that is, the AUC (Areaunder ROC Curve).
The AUC can be obtained by summing the area of each part of the ROC curve. Assuming that the ROC curve is formed (x1=0,xm=1) by a point ordered by the coordinates {(x1,y1), (x2,y2),..., (Xm,ym)}, the AUC is estimated to be:Real category 
Forecast Category 

Class No. 0 
Class 1th 

Class No. 0 
0 
cost01 
Class 1th 
Cost10 
0 
As shown in the table above, the overall learner performance evaluation is not about minimizing the number of errors, but minimizing the overall cost. Assuming that the No. 0 class in the table above is a positive class, the 1th class is the inverse class, and the d+ and Drespectively represent the positive and inverse sets, the costsensitive (costsensitive) error rate is:
Among them FPR is false positive case rate, FNR=1TPR is false counter example rate. The plotting of the cost curve: Each point on the ROC curve corresponds to a line segment on the cost plane, and the coordinates of the point on the ROC curve are (TPR,FPR), then the FNR can be calculated accordingly, and a line from (0,FPR) to (1,FNR) is plotted on the cost plane. The area under the line segment represents the expected overall cost under this condition. Thus, each point on the ROC curve is converted to a segment on the cost plane, and then to the lower bound of all segments, the area of the siege is the expected overall cost of the learner under all conditions.
5) Summary
The performance metric of a learner is a reference to the selection learner, which classifies several performance metrics below:
Category 
Performance metrics 

Equal cost 
Nonequal cost 

Learners of predictive classification 
Pr Curves and F1 
Costsensitive error rate 

Prediction of real value of the learning device 
ROC Curves and AUC 
Cost curve 
For common error rates and accuracy, it is also a learning performance metric for predictive classification. In fact, different tasks need to be measured by different indicators, while the specific focus on the indicators is different.
2.4 Comparison TestEvaluation of the performance of the learner, based on the test set has given the evaluation method and measurement model generalization ability of the performance measurement tool, then whether the performance measures can be compared to evaluate the value of the learning device? Obviously the answer is not so sure, because the performance evaluation methods and measurement tools on the test set are always on the test set, which is related to the selection of the test set itself, and the machine learning algorithm itself has some randomness. The statistical hypothesis test (hypothesis test) provides an important basis for the performance comparison of learners.
Based on the hypothesis test results, it can be inferred that the learner a observed on the test set is better than B, then the generalization performance of A is better than B in statistical sense, and the accuracy of this determination. In other words, the evaluation and measurement on the test set is further examined in statistical sense. The error rate is used as the performance measurement tool, and the hypothesis test method is introduced by means of E.
1) hypothesis test of a single learner
The hypothesis in hypothesis testing is a kind of judgment or conjecture on the distribution of the error rate of the learners ' generalization. In statistical sense, the distribution of generalization error rate is tested by hypothesis. Real task, only know the test error rate, and do not know the generalization error rate, there are differences, but from the intuitive, the likelihood of the two close to the larger, and the difference is very far from the probability is very small, so can be based on the test error rate to introduce the distribution of generalization error rate.
In fact, the evaluation method and performance measurement, formally based on the above statistical distribution of intuitive thinking to carry out, here is the hypothesis test to further confirm the test error rate and the generalization error rate are close.Data set 
Algorithm A 
Algorithm B 
Algorithm C 
D1 
1 
2 
3 
D2 
1 
2.5 
2.5 
D3 
1 
2 
3 
D4 
1 
2 
3 
Average ordinal value 
1 
2.125 
2.875 
The generalization performance estimated by the experimental method also needs to explain why it has such performance. DeviationVariance decomposition is an important tool to explain the generalization performance of learning algorithms.
Combing the whole idea of model evaluation and selection: firstly, the model evaluation is confronted with experience error and overfitting phenomenon, so the test set is introduced, the model is evaluated by the approximate generalization error rate of the test error rate, the evaluation method is presented and the measurement performance is quantified, based on which the hypothesis test is used to provide the basis for performance measurement, and finally the performance That is, deviationvariance decomposition.
The deviationvariance decomposition is used to disassemble the expected generalization error rate of the learning algorithm. The results of the algorithm in different training sets are likely to be different, even if the training sets are from the same distribution. For the test sample x, the yd is x in the data set of the tag, y is the true mark of X, F (x;d) for the training set D to learn the model F on the predictive output of X. For the regression task as an example, the expected prediction of the learning algorithm is:The sum of deviations, variances, and noise is precisely decomposed.
Deviations measure the degree of deviation between the expected predictions of the learning algorithm and the real results, that is, the ability to fit the learning algorithm itself; variance measures the change in learning performance caused by the electric of the same size training set, which depicts the impact of data disturbances. The noise expresses the lower bound of the expected generalization error that any learning algorithm can achieve in the current task, which depicts the difficulty of learning problem itself. Deviationdifference decomposition shows that generalization performance is determined by the ability of the algorithm, the adequacy of the data, and the difficulty of the learning task itself. Given the learning task, in order to achieve better generalization performance, it is necessary to make the deviation smaller, that is, whether the data can be fitted repeatedly and the variance is small, even if the effect of data disturbance is smaller.
However, deviations and variances are conflicting, known as deviationvariance dilemmas (biasvariance dilemma). Given the learning task and the training degree of the learning algorithm, the learner's fitting ability is not strong enough, the disturbance of training data is not enough to make the learner have a significant change, the deviation dominates the generalization error rate, and as the training degree deepens, the learner's fitting ability is gradually enhanced. The disturbance of the training data is gradually learned by the learner, the variance gradually dominates the generalization error rate, the learner's fitting ability is very strong after the training degree is sufficient, and the slight disturbance of training data will cause the learner to change significantly, if the training data itself, nonglobal characteristics of the learning device has been learned, will have happened to fit.Machine learning Notes (ii) model evaluation and selection