Reprint Please specify the Source: http://www.cnblogs.com/ymingjingr/p/4271742.html
Directory
Machine learning Cornerstone Note When machine learning can be used (1)
Machine learning Cornerstone Note 2--When you can use machine learning (2)
Machine learning Cornerstone Note 3--When you can use machine learning (3) (modified version)
Machine learning Cornerstone Note 4--When you can use machine learning (4)
Machine learning Cornerstone Note 5--Why machines can learn (1)
Machine learning Cornerstone Note 6--Why machines can learn (2)
Machine learning Cornerstone Note 7--Why machines can learn (3)
Machine learning Cornerstone Note 8--Why machines can learn (4)
Machine learning Cornerstone Note 9--machine how to learn (1)
Machine learning Cornerstone Note 10--machine how to learn (2)
Machine learning Cornerstone Note 11--machine how to learn (3)
Machine learning Cornerstone Note 12--machine how to learn (4)
Machine learning Cornerstone Note 13--Machine How to learn better (1)
Machine learning Cornerstone Note 14--Machine How to learn better (2)
Machine learning Cornerstone Note 15--Machine How to learn better (3)
Machine learning Cornerstone Note 16--Machine How to learn better (4)
XV, Validation
Verify.
15.1 Model Selection problem
Model selection issues.
So far, many algorithmic models have been learned, but a model requires a lot of parameter selection, which is the focus of this chapter's discussion.
Taking the two-yuan classification as an example, the learning algorithm has PLA, pocket, linear regression, logistic regression, the iterative method needs to be used in the algorithm, you need to choose the iterative step t; in the same way, we need to choose the learning step; when dealing with nonlinear problems, we need different conversion functions, which may be linear, two-dimensional, 10-time polynomial or 10-time le-de polynomial; If you add a regularization item, you can choose to L2 the regularization item, L1 the regularization item, and with the regularization item, the parameter value also needs to be selected.
It is obviously not possible to design a model for each data set in a variety of situations in so many options, so the choice of individual parameters is necessary for different situations.
The choice of course is based on a variety of assumptions and algorithms in a variety of models to find an algorithm to make the smallest. But the problem is not known with the case that is not known.
is it obvious that the way to find the best model is not feasible by finding the smallest possible? As shown in Equation 15-1.
(Equation 15-1)
But the move is unreasonable, and it is clear that the higher conversion function is smaller than the lower conversion function , and the regularization ratio may be greater than the non-regularization.
This approach is prone to overfitting and therefore not good.
In addition, different algorithm A in different hypothetical space H to look for each hypothetical space to make the smallest hypothetical function g, the hypothesis function g as a comparison, choose the smallest of them, in essence, in the space of each hypothesis in the joint space to find the smallest, its VC dimension, will have poor generalization performance.
The training dataset does not work, so use the test data set and choose the smallest hypothetical function, as shown in the formula 15-2.
(Equation 15-2)
Unlike the use of training data, this situation has howding inequalities that guarantee generalization, as shown in Equation 15-3.
(Equation 15-3)
The howding inequality is explained in section 7.4, which indicates the complexity of the model. But is that the right thing to do? Actually using the test data to select model parameters is a cheat behavior.
Compare the above two methods, as shown in table 15-1.
Table 15-1 Comparison of uses and comparisons when selecting parameters
Training Data Error |
Test error |
Data from the training data D |
The test comes from the test data |
Feasible (owning this data) |
Not available (without this data) |
have been contaminated (training data determines the hypothetical space and can no longer be used for parameter selection) |
It's clean. |
Both ways (and) are not good, so design an intermediate method. The training data is reserved as part of the validation (validation) data for the selection parameters, using representations. This data is self-possessed and is also uncontaminated, using confirmation data to find the smallest.
15.2 Validation
Verify.
Then the validation data mentioned in the previous section continues, in order to achieve data that is both usable and pollution-free, the original sample data D is divided into two parts: training data and validation data, as shown in Equation 15-4.
(Equation 15-4)
The best function obtained from the original large sample DataSet D is shown in Equation 15-5.
(Equation 15-5)
The best function obtained from the training sample dataset is shown in Equation 15-6.
(Equation 15-6)
Note that the validation data set is a K sample extracted from DataSet D independently of the distribution probability of the joint probability distribution. Similar to the previous section of Equation 15-3, where the substitution is used to do the test, the generalization is guaranteed by equation 15-7. Note that the portion of the denominator in the root of the root is also replaced by the amount of data being validated.
(Equation 15-7)
In summary, the algorithm flow using this method is shown in 15-1.
Figure 15-1 algorithm flow using the training dataset and validating the data set
First, the original DataSet D is divided into two parts: the training data set and the validation data set. Different models (different hypothetical space H and different algorithm A, including different parameters in the algorithm) use the training data set to get different best functions, and each of the hypothetical space H gets the best functions. Then use the validation data set, choose the least validation error, the best model, the process can be written in the form of Equation 15-8.
(Equation 15-8)
This model trains the entire data set D to get the final assumptions and the smallest. Note that the end result is a re-workout using the entire dataset D, rather than directly using the original training data set, as the theoretical data is larger and closer to the end, as shown in Equation 15-9.
(Equation 15-9)
Connect equation 15-7 to equation 15-9 with equation 15-10.
(Equation 15-10)
This is just a theory, or even an intuitive conclusion, and the following is a graph of the data obtained from the experiment to show that the inequalities are true, as shown in 15-2. The horizontal axis of the graph indicates the size of the validation data set K, the vertical axes indicate the error rate, and the black dashed line indicates the best possible use of the test data set (which is impossible in reality); Blue represents the best function to be re-trained by validating the best model number of a dataset, and then using the entire sample set. The black line represents the best function to be obtained using the entire sample set, and the red line represents the best function to be obtained directly from the training sample. The conclusion can also be obtained from the graph.
Figure 15-2 Verifying the feasibility of a data set
Explain why the red line is getting worse when validating samples that are larger than a certain number, because the sample is always fixed, and when the validation sample increases, the training sample becomes smaller and less so that the resulting assumptions are getting worse.
Note that equation 15-10 has two approximate relationships, where k is large enough, with the closer, but because the n-k is becoming smaller, resulting in a greater difference from the other, the same proper k enough hours, close enough, but with a large difference. K's choice has become a new problem, Mr. Lin recommended the choice of K for.
15.3 Leave-one-out Cross Validation
Leave a method of cross-validation.
The value of K in the previous section is important, assuming an extreme case where the validation data set is K=1, then the validation error rate is. How to make it closer, use the left-over cross-validation (Leave-one-out Validatio), whose formula is 15-11.
(Equation 15-11)
It may be complicated to see the formula, and then it shows how to use the stay-in-a-way cross-validation. Suppose there are three sample points and two models: a linear model and a constant model, with results 15-3 and 15-4 shown.
Figure 15-3 Linear model with one-way cross-validation
The formula for solving the linear model with a cross-validation method is shown in Equation 15-12.
(Equation 15-12)
Fig. 15-4 constant model with one-way cross-validation
The formula for solving the constant model with a cross-validation method is shown in Equation 15-13.
(Equation 15-13)
Choose the best model from Equation 15-14.
(Equation 15-14)
From Equation 15-12 to equation 15-14 the available constant model is smaller than the linear model error rate, and this result embodies the idea of reducing complexity.
The most important question is whether it is close or not. The following is a theoretical analysis of the relationship between the average and the number of samples in N-1.
(and both are linear functions, so you can swap positions.) )
(You can split the sample data set to remove the validation sample and a single validation sample and write it in its true form)
(Equation 15-14)
So it proved to be close to, and almost identical to, because only one sample was sent.
In theory, it is possible to prove the validity of a cross-validation can be more abstract, the following from the experimental point of view, the handwritten numeral set as an example, 15-5, Blue is expressed as "1" of the sample, red represents a non-"1" sample, with symmetry (symmetry) and density (intensity) as a sample of the eigenvalues, A picture in the middle is the optimal function that is obtained by the basis of the division, is obviously not smooth, the right image is based on the optimal function of the division, smoothing a lot.
Figure 15-5 The handwritten data set shows the effect of leaving a cross-validation
The graph of the number of features (i.e., the complexity of the model, the more complex the model) and the error rate (and) are plotted through experiments. As shown in 15-6.
Figure 15-6 Diagram of the relationship between the number of features and the error rate
It is not difficult to conclude that as the complexity of the model increases and decreases (because the higher-order function fits the sample points), the difference between the two is increasing, while the curve is very close. So it's much better than that.
15.4 V-fold Cross Validation
K-fold cross-validation.
In the previous section, there are two problems in the cross-validation of the left-hand method, one is that when the data volume is slightly larger, the cost of selecting the model is large, for example, the sample is 1000, then each model will run 1000 times with 999 samples, so the usability is not high in the actual application, and the stability is not good, because it Causes the curve of the 15-6 curve to fluctuate too much and is very unstable. In practical applications, it is seldom to use a cross-validation method.
In order to solve the two problems of the cross-validation, a cross-validation method is proposed, in which the sample data set is divided into five parts of equal size, and the V-1 parts are chosen to do the training, and some of them are verified. The cross error is shown in Equation 15-15.
(Equation 15-15)
The formula for selecting the optimal model is shown in Equation 15-16.
(Equation 15-16)
Mr. Lin recommended a choice of V for 10.
Machine learning Cornerstone Note 15--Machine How to learn better (3)