Regularization and Model Selection)

Last Update:2018-12-07 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1 problem

Model Selection Problems:You can choose multiple models for a learning problem. For example, linear regression or polynomial regression can be used to combine sample points in a group. Which model is better to use (the best balance can be achieved between deviation and variance )?

Another typeParameter Selection Problems: If we want to use a regression model with a weight value, how can we choose to weight the parameters in the formula W?

Formal Definition: assuming that the optional model set is, for example, if we want to classify, then SVM, logistic regression, neural network, and other models are included in M.

2. Cross Validation)

Our first task is to select the best model from M.

Assume that the training set is represented by S.

If we want to minimize the risk of experience to measure the model quality, we can choose the model as follows:

1. Use s to train each parameter. After training the parameter, you can obtain the hypothesis function. (For example, after the linear model is obtained, the hypothetical function is obtained)

2. Select the hypothesis function with the minimum error rate.

UnfortunatelyAlgorithmNot feasible. For example, if we need to fit some sample points, the use of higher-order polynomial regression must be smaller than the linear regression error rate and smaller deviation, but the variance is large and will be over-fitted. Therefore, the improved algorithm is as follows:

1. Randomly select 70% of samples from all training data s as the training set, and the remaining 30% as the test set.

2. Obtain the hypothesis function for each training part.

3. In each test, experience errors are obtained.

4. Select the best model with the least experience error.

This method is called hold-out cross validation or simple cross verification.

Because the test set and the training set are two worlds, we can think that the empirical error here is close to the generalized error ). The proportion of the test set generally accounts for 1/4-1/3 of all data. 30% is a typical value.

You can also improve the model. After selecting the best model, you can perform a training on all data S. Obviously, the more training data, the more accurate the model parameters.

The weakness of the simple cross-validation method is that the optimal model is selected on 70% of the training data, which does not mean that the best model is on all training data. In addition, when the training data is very small, the training data is too little after the test set is split.

We will make an improvement on the simple cross-validation method, as shown below:

1. Divide all training sets s into k non-Intersecting subsets. Assume that the number of training samples in S is m, then each subset has M/K training samples, the corresponding subset is called {}.

2, each time from the model set m out a, and then in the training set to select a K-1

{} (That is, leave only one at a time), after training with this subset of the K-1, get the hypothetical function. Finally, we used the remaining one for testing and got experience errors.

3. Since we leave one (J from 1 to k) each time, we will get K empirical errors. For one, the empirical error is the average of these K empirical errors.

4. Select the smallest average empirical error rate, and then use all s to perform the training again to get the final result.

This method is called K-fold cross validation (k-fold cross validation ). To put it bluntly, this method is to change the test set for simple cross-validation to 1/K. Each model is trained K times, tested K times, and the error rate is an average of K times. Generally, the value of K is 10. In this way, data can be basically sparse. Obviously, the disadvantage is that there are too many training and tests.

In extreme cases, K can be set to M, meaning that a sample is left for testing each time. This is called leave-one-out cross validation.

If we have invented a new learning model or algorithm, we can use cross-validation to evaluate the model. For example, in NLP, we divide the training into one part of the training and one part of the training for testing.

3. Feature Selection)

Feature Selection is strictly one of model selection. The relationship between them is not analyzed here, but the problem is highlighted. Suppose we want to perform regression on the sample points with the dimension n. However, N may be much larger than the number of training samples M. However, we feel that many features are useless for the results, and we want to Remove useless features in N. N features are removed (each feature is removed or retained). If we enumerate these cases and use cross-validation to test the model error rate one by one, it is unrealistic. Therefore, some heuristic search methods are required.

First, forward search:

1. The initialization feature set F is empty.

2. Scan I from 1 to n,

If feature I is not in feature F, put feature I and feature F together as (that is)

The error rate obtained by cross-validation when only medium features are used.

3. The minimum error rate of N selected from the previous step is updated to F.

If the number of features in F reaches N or the pre-set threshold (if any), the best F in the entire search process is output, but the number of features in F does not reach 2.

The forward search belongs to the wrapper model feature selection. Wrapper refers to the continuous use of different feature sets to test learning algorithms. In the forward search, a feature set is selected from the remaining unselected features incrementally. When the threshold value or N is reached, the minimum error rate is selected from all F.

Since there is incremental addition, there will also be incremental subtraction, which is called Backward Search. First, set F to {1, 2,..., n}, then delete a feature each time, and evaluate it until it reaches the threshold or is empty, and then select the best F.

Both algorithms can work, but the computing complexity is relatively large. The time complexity is.

Second, filter feature selection):

The idea of filtering the feature selection method is to calculate the amount of information relative to the category tag for each feature from 1 to n, get n results, and then rank n according to the descending order, output The first k features. Obviously, the complexity is greatly reduced to O (n ).

The key question is what method is used for measurement. Our goal is to select the most closely associated with Y. Both y and Y have probability distribution. Therefore, we want to use mutual information for measurement, which is more suitable for discrete values. Instead of discrete values, we want to convert them into discrete values. This method has been mentioned in the first article "regression recognition.

Mutual Information formula:

When it is a 0/1 discrete value, this formula is as above. It is easy to generalize to the case where multiple discrete values exist.

Here, the sum is obtained from the training set.

If you ask how to obtain this mi formula, refer to its KL distance (Kullback-Leibler:

That is to say, mi measures the independence from Y. If the two are independent (), the KL distance value is 0, that is, it is irrelevant to Y and can be removed. On the contrary, if the two are closely related, the MI value will be large. After ranking Mi, the last remaining question is how to select the K value (the first K ). We continue to use the cross-validation method to scan K from 1 to n and obtain the largest F. However, this time the complexity is linear. For example, when using Naive Bayes to classify text, the word table has a large length of N. The filter feature selection method can increase the accuracy of classifier.

4. Bayesian statistics and regularization (Bayesian statistics and Regularization)

The question is a bit difficult. To put it bluntly, we need to find a better estimation method to reduce the occurrence of over-fitting.

The estimation method used in linear regression is the least square method, logistic regression is the maximum likelihood estimation of conditional probability, Naive Bayes is the maximum likelihood estimation of joint probability, and SVM is the second planning.

Previously we used the maximum likelihood estimation method (for example, used in logistic regression ):

Note the maximum likelihood estimation and the statement in Wikipedia.

Http://zh.wikipedia.org/wiki/%E6%9C%80%E5%A4%A7%E5%90%8E%E9%AA%8C%E6%A6%82%E7%8E%87

Some discrepancies are because Wikipedia only records samples (observed data) as X and calculates the maximum probability of P (x. However, the samples here are divided into feature X and class tag y. We need to calculate p (x ). In a discriminant model (such as logistic regression), we consider p (x) = p (x, y) = P (Y | x) p (x), while p (x) and Independent, so the final argmax p (x) is determined by argmaxp (Y | X ). . Strictly speaking It is not equal to the probability of sample X, but p (x) is determined , When maximization, p (x) is also maximized. In model generation, such as naive Bayes, p (x) = P (y) p (x | Y ), that is, the product of probability and prior probability of feature x appears under a class label y. P (x | Y), where each component of X is independent, can be calculated by probability multiplication. There is no parameter here. . Therefore, the maximum likelihood estimation can directly estimate p (x, y) and change to the probability of joint distribution.

In this formula, we regard the parameter as an unknown constant vector. Our task is to estimate the unknown.

From a broad perspective, the perspective of maximum likelihood estimation is called the frequency School (frequentist Statistics). We think that it is not a random variable but an unknown constant, so we didn't write it.

Another perspective is the Bayesian School, which treats it as a random variable with unknown values. Since it is a random variable, different values have different probabilities (called a prior probability), representing our trust in a specific one. We represent the training set as I from 1 to M. The posterior probability that needs to be obtained first:

The derivation of this formula is quite awkward. The first step is understandable. The second step first looks at the molecules. The most complete expression is . Because it also appears in the denominator , So Will be canceled. Of course, the author did not consider it at all. Because his opinion on P (S) is X-> Y, rather than (x, y ). Let's look at the denominator. After the denominator is written in this form, it means Possible value for points. Which in parentheses means And then expand the appearance of the component mother. From a macro perspective, it is determined by a certain probability when finding the probability of each sample. And then And Then confirm . If I had to deduce this formula, I could write the denominator as follows: In this way, the export result is . I don't know what I think, right? What is the difference? The author selects a new one for each sample. And I chose one for the overall sample. .

Different models have different calculation methods. For example, in Bayesian logistic regression,

The expression of P is the bernuoli distribution.

In the case of random variables, if the new sample feature is X, it is used to predict y. We can use the following formula:

Obtained from the preceding formula. If we want expected values, apply the expected formula:

Most of the time, we only need to obtain the largest y (when Y is a discrete value ).

This solution is different from the previous method. In the past, it was first obtained and then directly predicted. This time it was used to make all possible points.

To sum up the differences between the two, the maximum likelihood estimation is not Considered as Y's estimation parameter It is a constant, but its value is unknown. For example, we often use the constant C as the suffix of Y = 2x + C. However The calculation formula of contains unknown numbers. . Therefore, after deriving the maximum likelihood estimation, We can find .

Bayesian estimation is regarded as a random variable. The values satisfy a certain distribution. Instead of a fixed value, we cannot obtain the value through calculation. Instead, we can only calculate the points during prediction.

However, in the Bayesian estimation method described above, although the formula is reasonable and elegant, it is difficult to calculate the posterior probability. According to the formula, when calculating the denominator

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More