sixth week. Design of learning curve and machine learning system
Learning Curve and machine learning System Design
Key Words
Learning curve, deviation variance diagnosis method, error analysis, numerical evaluation of machine learning system, big Data principle
Overview
This week's content is divided into two:
First talk. Advice for applying machine learning, the main content is about the deviation, variance and the learning curve as the representative of the diagnostic method, in order to improve the computer learning algorithm to provide a basis for decision-making;
Second speaking. Machine learning system design, the main content is the computer learning algorithm numerical evaluation criteria: accuracy (cross-validation set error), precision ratio precision, recall rate recall and F value, gives the machine learning system of the process.
============================== First Lecture ==============================
========= Diagnostic method for the representation of deviation, variance and learning curve ==========
(i) Model selection Model selection
When evaluating hypothetical functions, we are accustomed to dividing the entire sample according to 6:2:2:60% training Set training set, 20% cross-validation set, validation set, and 20% test set, respectively, for fitting hypothesis functions, model selection, and prediction.
The error of three sets is shown in the following figure (note that no regularization is used):
Based on the above partitioning, we have three steps for model selection:
Step1. Use the test set training set to train multiple models (such as straight lines, two curves, three curves);
Step2. Using cross-validation sets to verify the multiple hypothetical functions obtained from the STEP1, select the model with the smallest cross-validation set error;
Step3. The optimal model of STEP2 selection is predicted by test set testing set.
Taking linear regression as an example, suppose you use the linear regression model to minimize the cost function j (θ) to get a hypothetical function h (x), how to determine if the assumption function to fit the sample is good or bad, is not that all points passed (the cost function J minimum) must be the most ideal.
Or so, to give you the sample point below the graph, you choose the line, two curves, or three curves ... To fit it as a hypothetical function.
In the following diagram, your model selection is directly related to the final fit result :
=======================================
Under-Fit Underfit | | High Deviation High bias
Normal Fit Just Right | | both deviations and variances are small
over fitting Overfit | | High Variance High Variance
=======================================
The above problem is just a point in the model selection process to consider------polynomial number d, in fact, we will also consider such two parameters: the regularization parameter λ, sample size M.
I will make a brief summary of the relationship between the three quantities of polynomial D, the regularization parameter λ, the sample size m and the fitting results.
(ii) deviation, variance, learning curve bias, Variance, learning curve
1. The degree of the characteristic quantity D
As in the previous example, with two curve fitting, the errors of the training set and the cross-validation set may be very small; but you use a straight line to fit, regardless of how advanced the algorithm to reduce the cost function, the deviation is still very large, this time we say: polynomial number D is too small, resulting in high deviation, less than fit Similar when using 10 times curve to fit, sample points can pass, corresponding cost function (error) is 0, but with cross-validation set you will find that the fitting is very poor, this time we said: polynomial number D is too large, resulting in high variance, over-fitting.
Therefore, the relationship between the polynomial number D and the training set and the cross-validation set error is as follows:
2. Regularization Parameters λ
Regularization parameters we introduced in the third Zhou, the greater the regularization parameter λ, the more severe the penalty for θ, θ->0, assuming that the function is a horizontal line, under-fitting, high-deviation, the smaller the regularization parameter, the weaker the effect of regularization, over fitting and high variance. The relationship is shown in the following figure:
3. sample size m and learning curve learning curve
The learning curve is the relationship between sample size and training set and cross-validation set error, which is divided into two cases of high deviation and high variance (under-fitting and over-fitting).
① High deviation (due to fit):
According to the analysis in the right part of the following figure, by increasing the sample size both the error is very large, that is, the increase of M is useless for the improvement of the algorithm.
② Gaofangcha (over fitting):
According to the analysis in the right part of the following figure, the training set sample is very good (over fitting) by increasing the sample size, and the increase of M has a little help for the improvement of the algorithm.
¡ï ★ (c) How to make decisions
In summary, you will find that there is a conclusion that:
Training set error is large, cross-validation set error is also large: Under-fitting, high deviation, polynomial number D is too small, λ too large;
Training set error is small, cross-validation set error is very large: over-fitting, high variance, polynomial number D is too large, λ too low, sample size is too small.
This provides the basis for us to improve the machine learning algorithm.
============================== Second lecture ==============================
Design ====== of ======= machine learning system
(i) The design process of the machine learning system
Step1. The use of fast but imperfect algorithm implementation;
Step2. Draw a learning curve, analyze deviations and variances, determine if more data is needed, increase the number of features ...;
STEP3. Error analysis: Artificial detection of errors, the discovery of system shortcomings, to increase the characteristics of the system to improve.
Take the spam classification as an example: At first you may not find too many features, in addition to the $, buy, discount and other keywords, this time you should quickly implement it, and then use cross-validation set to test, manual to check the common characteristics of the wrong message ( For example, if you find that HTTP hyperlinks are more than what you would expect at the beginning, you can add these features to your model and re-experiment to optimize them.
(ii) Criteria for numerical evaluation of machine learning algorithms
1. Cross-validation set error (accuracy)
This is a good idea, the design of the fitting function if the cross-validation set test error is very large, then certainly not a good learning algorithm;
However, is not that the error is must not must be a good learning algorithm it. For example, this example is also called oblique bias class :
The prevalence of some kind of cancer is 0.5%, you designed a learning algorithm (considering a variety of feature minimization cost function) to get the cross-validation set accuracy rate of 99%, but in fact there is such a prediction---directly think that the sample is not disease (regardless of the characteristics of the sample), The accuracy of its cross-validation set is 99.5%, and this prediction is not good. Obviously it's not our goal.
So the evaluation of the learning algorithm has the following criteria: the highest possible precision and recall rate.
2. Precision ratio, recall rate and F-value precision, recall, F score
Precision: You predict the probability of the final onset of the sample's sample;
Recall Rate: A sample of the final illness, which you have previously predicted the probability of his illness;
The high precision means that we tell him that he is ill (or that he is not easily predicted to be ill) when we are extremely sure of the sample's illness;
A high recall rate means that the sample may be diseased and we will tell him (or understand it as a universal concern);
The expression is shown in the following figure:
Or take the previous example of cancer, you always predict the disease y=0, recall rate is 0, we want to get the learning algorithm is not only high prediction accuracy, but also have the highest precision and recall rate, so this simple prediction y=0 method is not good.
Precision and recall rates are sometimes not available, so it is necessary to weigh both, based on the criteria---f values.
The F-value gives a good estimate of the precision and recall rate of the numerical calculation criteria (evaluation measures), the calculation formula is shown in the following figure:
(iii) The principle of big data Large data rationale
Large amounts of data can greatly improve the final performance of the learning algorithm, rather than whether you use more advanced algorithms, etc., so there is a sentence:
"It's not a who had the best algorithm that wins. It's Who's have the most data.
Of course, based on the two-point premise hypothesis:
1. Assume that the characteristics of the sample can provide sufficient information to predict;
You can't expect to know only the size of the house to predict the price, whether you're a real estate expert or not;
2. Assume that the sample can provide as many characteristics as possible;
The more characteristic quantity, the less prone to the problem of less fitting and higher deviation;
So there is the conclusion that:
1. The larger the amount of data, the less likely the problem of high variance and overfitting can occur;
2. The more characteristic quantity, the higher deviation, the less fitting problem is impossible to occur;
================================ Epilogue ==============================
This week, we focused on the numerical evaluation criteria for the diagnostic and machine learning algorithms represented by deviations, variances, and learning curves: accuracy, precision, recall, and