Rule and model selection (regularization and models selection)

Source: Internet
Author: User
Tags wrapper
1 Questions

model selection Problem: There are several models to choose from for a learning problem. For example, to fit the sample points of a group, you can use linear regression or polynomial regression. So what is the best model to use (to be able to achieve a balance between deviations and variances).

There is also a class of parameter selection problems : If we want to use a regression model with weights, then how to choose the parameters in the weight w formula.

Formal definition: Assuming an optional set of models is, for example, we want to classify, then SVM, logistic regression, neural networks and other models are included in M. 2 cross-validation (validation)

Our first task is to choose the best model from M.

Suppose the training set uses S to represent

If we want to use empirical risk minimization to measure the quality of the model, then we can choose the model as follows:

1, using S to train each one, training parameters, you can also get the hypothesis function. (for example, when a linear model is obtained, a hypothetical function is obtained)

2. Choose the assumption function with the least error rate.

Unfortunately, this algorithm is not available, for example, we need to fit some sample points, the use of higher-order polynomial regression is certainly less than the linear regression error rate, the deviation is small, but the variance is very large, will be over-fitted. Therefore, we improved the algorithm as follows:

1, from the entire training data s in the random selection of 70% samples as a training set, the remaining 30% as a test set.

2, on the training of each, get the hypothesis function.

3, on the test each, get the corresponding experience error.

4. Choose the best model with minimum experience error.

This method is referred to as Hold-out Cross validation or as simple crossover validation.

Because the test set is and the training set is two worlds, we can assume that the empirical error here is close to a generalization error (generalization error). The scale of the test set here generally accounts for the 1/4-1/3 of all data. 30% is a typical value.

The model can also be improved, when the best model is chosen, and then a training on all data s, obviously the more training data, the more accurate model parameters.

The weakness of the simple cross-validation approach is that the best model is the one that is picked up in 70% of the training data and does not represent the best of all training data. And when the training data is very rare, after the test set, the training data is too little.

We will make another improvement to the simple cross-validation method as follows:

1, the entire training set S is divided into k disjoint subsets, assuming that the number of training samples in S is M, then each subset has m/k training sample, the corresponding subset is called {}.

2, each time from the model set M to take out one, and then in the training sub-set to select a K-1

{} (i.e. leaving only one at a time), use this k-1 subset to train and get the assumed function. Finally, use the rest of the test to get experience errors.

3, because we leave one at a time (J from 1 to K), so we will get K experience error, then for one, its experience error is the average of this K experience error.

4, choose the average experience error rate of the smallest, and then use all the s to do another training, get the final.

This method is known as K-fold Cross validation (K-fold crossover validation). To put it bluntly, this method is to change the simple cross-validation test set to 1/k, and each model trains k times, tests K times, and the error rate is the average of k times. Generally speaking K value is 10. This is basically possible when the data is sparse. Obviously, the disadvantage is that there are too many training and testing times.

In extreme cases, K can have a value of M, meaning that each time a sample is tested, this is called Leave-one-out Cross validation.

If we invent a new learning model or algorithm, then cross-validation can be used to evaluate the model. In NLP, for example, we focus our training on part of the training and part of the test. 3 Feature selection (Feature selection)

Feature selection is also strictly a kind of model selection. There is no way to discriminate between them and to highlight the problem. Suppose we want to go back to a sample point where the dimension is n, however, N is probably much larger than the training sample number M. But we feel that many features are useless for the result, and want to eliminate the useless features in N. N features have a kind of removal situation (each feature goes or is retained), if we enumerate these cases, and then use cross-validation to examine the error rate of the model in this case, it is too unrealistic. Therefore, some heuristic search methods are needed.

First, forward search:

1, initialize the feature set F is empty.

2, scan i from 1 to N,

If the I feature is not in F, then put the feature I and f together as (i.e.)

The error rate that is obtained by using cross-validation in cases where the feature is used only.

3, from the previous step to get the n the smallest error rate, update F for.

If the number of features in F reaches N or the pre-set threshold (if any), then the best f in the entire search process is output, not reaching 2

Forward search belongs to wrapper model feature selection. Wrapper here refers to the constant use of different special collection to test the learning algorithm. Forward search it is clear that each increment from the remaining unselected features a join feature set, to reach the threshold or N, from all F to choose the lowest error rate.

Since there is an increment, there is also an increment minus, which is called a back search. Set F to {,.., n} first, then delete one feature at a time, and evaluate until the threshold is reached or empty, then select the best f.

Both of these algorithms can work, but the computational complexity is relatively large. The complexity of time is.

Second, filter Feature selection (Filter Feature selection):

The idea of the Filter Feature selection method is for each feature, I from 1 to N, calculates the amount of information relative to the category label, obtains n results, and then the N is ranked from the big to the small, the first k features of the output. Obviously, this complexity is greatly reduced to O (n).

So the key question is what method to measure, and our goal is to select some of the most closely related to Y. And Y and both have a probability distribution. So we think of the use of mutual information to measure, for the case of discrete values is more appropriate, not discrete values, it is transformed into discrete values, the method is mentioned in the first "regression understanding".

Mutual information (Mutual information) formula:

When the value is 0/1 discrete, the formula is as above. It is easy to generalize to the case of multiple discrete values.

Here, and

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.