Watermelon Book chapter II

Last Update:2016-07-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The 2nd Chapter Model Evaluation and selection

2.1 Experience Error and overfitting

The error of the learner in the training set is called "Training error" or "Experience error", and the error on the new sample is called "Generalization error"; overfitting is generally due to the fact that learning ability is too strong to learn the less common characteristics of training samples, while less-fitting is usually caused by low learning ability. Under-fitting is easier to overcome, for example, in the decision tree learning to expand the branch, in the neural network learning to increase the number of training wheel, and the overfitting is very troublesome, it is unavoidable, machine learning problems are usually NP difficult or even more difficult, and the effective learning algorithm must be in the polynomial time to complete the operation, if you can completely avoid overfitting , the optimal solution can be obtained by minimizing the empirical error, that is to say, the structural proof of "P=NP", so long as the belief that "P is not equal to NP", overfitting can not be avoided.

2.2 Evaluation methods

The test set should be mutually exclusive with the training set whenever possible.

"Set aside" divides dataset D into two mutually exclusive collections, one as the training set S and the other as the test set T. The sampling mode of the reserved class scale is usually called "stratified sampling", when using the retention method, it is generally necessary to use several random division and repeat the evaluation result of the experiment evaluation to take the average value as the method of retention, the problem of the retention method is that it is not good to determine the proportion of the training set.

Cross validation, the data set D is first divided into K-sized mutually exclusive subsets, each subset as far as possible to maintain the consistency of the data distribution, that is, from D through stratified sampling, and then each time with the k-1 subset of the set as a training set, the remaining subset as a test set, So we can get K training \ Test set, so as to carry out K training and test, and finally return the mean value of the K test results, the stability and fidelity of the evaluation results of cross-validation method depends to a great extent on the value of K. Assuming that the data set D contains m samples, if another k=m, then the cross-validation of a special case, leaving a method (Leave-one-out, abbreviated loo), the disadvantage of leaving a method is that when the data set is relatively large, the computational overhead of training m model may be unbearable.

The "self-help Method" is directly based on the self-service sampling method, given a data set containing M samples D, sampling it to produce a DataSet d ': Randomly pick a sample from D, copy it into d ', and then put the sample back into the original DataSet D, so that the sample will still be picked up at the next sample After repeating the above process m, a dataset containing M samples is obtained d ', and the self-help method is useful when the dataset is small and difficult to effectively divide the training \ Test set.

Generally, using the discriminant effect on the test set to estimate the generalization ability of the model in actual use, the training data is divided into training set and verification set, and the model selection and parameter adjustment are based on the performance of the validation set.

2.4 Performance metrics

Performance measure, the most common performance metric for regression tasks is "mean square error" (Mean squared error)

　　error rate and accuracy : The error rate is the proportion of samples with the wrong number of samples, and the accuracy is the proportion of the correct number of samples to the total number of samples.

　　precision ratio (accuracy precision) and recall rate (recall rate recall) and F1: "The amount of information retrieved is the number of users interested in the" "How much of the information that the user is interested in is retrieved."

Real situation	Forecast results
Real situation	Positive example	Counter Example
Positive example	TP (real example)	FN (false instance)
Counter Example	FP (False positive example)	TN (True counter example)

Precision ratio P and recall R are respectively defined as: p=tp/(TP+FP) r=tp/(TP+FN). When the algorithm rate is high, the recall is generally low, and the accuracy is generally low when the full rate is high.

P-r curve under the size of the area, to a certain extent, the learner in the precision and recall to obtain a relative "double high" ratio, but not easy to estimate, so the balance point (break-even points) is the precision ratio = recall when the value, more commonly used is the F1 metric, that is f1=2pr/(p+r) , in some applications, the precision and recall of the attention to the degree of different, such as recommended information generally as far as possible to recommend users interested in things, so the requirements of high check rate, and the fugitive information Monitoring system requires a high rate of check. The general form of the F1 metric is

　　Roc and AUC: According to the probability prediction results, the test samples can be sorted, "most likely" is the first example of the first, "most unlikely" is a positive example of the last side, the classification process is equivalent to a "truncation point" in this sort of sample divided into two parts, the first part as a positive example, The latter part as the inverse example, in different application tasks, according to the task needs to adopt different truncation points, if more attention to precision, you can choose to sort in the front of the position to truncate, if more attention to recall, you can choose the position after the truncation. The ROC (Receiver Operating characteristic) curve originates from the radar signal analysis technique used in the "World War II" for detecting enemy aircraft. The longitudinal axis of the ROC curve is the "true example rate" (True Positice RATE,TPR), and the horizontal axis is the false positive rate (false Positive RATE,FPR):

tpr=tp/(TP+FN) fpr=fp/(TN+FP)

ROC curve plotting process: given m+ positive and M-counter examples, according to the learner prediction results to sort the sample, and then set the classification threshold to the largest, that is, all samples are predicted as a counter-example, at this time the real example rate and false positive example rate are 0, at coordinates (0,0) mark a point, The classification thresholds are then set to the predicted values of each sample, in turn each sample is divided into a positive example. Set the previous marker point coordinate to (x, y). If the current is a real example, the coordinates of the corresponding marker point are (x,y+1/m+), and if the current is false, then the coordinates of the corresponding marker point are (x+1/m-,y) and then the adjacent points are connected with the segments.

The comparison of ROC curves is generally achieved by comparing the area under the ROC curve, i.e. the AUC (areas under ROC Curve).

2.5 Deviations and variances

The expected prediction of the learning algorithm is

The generalization error can be decomposed into the sum of deviation, variance and noise. Deviations measure the degree of deviation between the expected and true results of the learning algorithm, that is to predict the fitting ability of the learning algorithm itself; variance measures the change in learning performance caused by the change of the same size training set, which depicts the effect of data disturbance. The noise expresses the lower bound of the expected generalization error that any learning algorithm can achieve in the current task, which depicts the difficulty of learning problem itself. Generalization performance is a learning task, in order to achieve good generalization performance, it is necessary to make the deviation smaller, that is, to fully fit the data, and make the variance smaller, even if the effect of data disturbance is small.

In general, deviations and variances are conflicting.

Watermelon Book chapter II

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Watermelon Book chapter II

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Watermelon Book chapter II

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support