Stanford University Machine Learning notes-overfitting problems and regularization solutions

Source: Internet
Author: User

When we use the linear regression and logistic regression described in the previous blog, there is often an over-fitting (over-fitting) problem. The next definition is fitted below:

overfitting (over-fitting):
The so-called overfitting is: if we have very many characteristics, the assumptions learned by using these features can be very well adapted to the training set (the cost function is very small, almost 0), but this assumption may not be extended to new data (the result of the new data prediction is not good, that is, we say that the generalization ability is not strong).
Here we see from the examples what is the concept of overfitting, under-fitting, and so on:

The above three groups are for the housing price prediction, for the first group to adopt a linear model, it is obvious that the effect of the fitting is not very good, the error is still very large, this phenomenon is called under-fitting (sometimes referred to as high deviation); The third group model uses four-square model, for training samples, the effect of fitting is very good The loss is almost zero, but this fitting is too focused on fitting the training data and forgetting the meaning of the training model, the prediction of new data, which is not good for new data predictions, which is called overfitting (high variance). Obviously, the middle of the two-square model can well fit training data (although the loss relative to the third model is larger, but it can well characterize the characteristics of the data, but also good robustness to the data, but the data is affected by a certain factor in which a few of the data deviate from the original data, the model can still be well-fitted).
The phenomena of overfitting and under-fitting are discussed from the regression problem, which is also applicable to the classification problem.
solve the Overfitting method: Remove some features that do not help us predict correctly. Generally when the number of features is more, training samples are relatively young, easy to appear in the fit. So you can manually select some of the relevant features, remove some irrelevant features, and we can also use some algorithms to choose, such as PCA (follow-up explanation). Regularization of the. The method retains all the characteristics, only reducing the size of the parameter. The regularization method has a very good performance when compared with the features, and each feature can have a little impact on the predicted results.

Let's take an example to understand what regularization is:
Or from the above example, we look at the third image, the fitting obvious existence of the phenomenon, its hypothesis is:
hθ (x) =θ0+θ1 (x1) +θ2x22+θ3x33+θ4x44 H_{\theta} (x) =\theta _{0}+\theta _{1} (X_{1}) +\theta _{2}x_{2}^{2}+\theta _{3}x_{3 }^{3}+\theta _{4}x_{4}^{4}

The process of regularization in a cost function:
By comparing these three images, it is not difficult to find that the cause of overfitting is due to the high-level characteristics, so we should reduce the weight of the high-θ3, θ4 \theta _{3}, \theta _{4}, we can start with the cost function, for the parameter θ 3, θ4 \theta _{3} , \theta _{4} added penalty, so that the parameters θ 3, θ4 \theta _{3}, \theta _{4} decreased, the modified cost function can be:

Let us analyze the implementation process, for the above expression, our goal is to minimize the loss function, but for θ 3, θ4 \theta _{3}, \theta _{4}, the front coefficient is very large, in order to minimize the loss function, you must make θ3, Θ4 \theta _{3}, \ Theta _{4} has a small value to reduce θ 3, θ4 \theta _{3}, \theta _{4}.

The above is only from specific examples to achieve regularization process, but in general, the number of features more, we do not know which parameters should be punished, so we will punish all the parameters, and let the cost function to choose the degree of punishment, so there is a regularization of the loss function is as follows:

The previous section represents the smallest loss of data fitting, and the next one represents the bottom of the parameter overlay. Λ\lambda is called a regularization parameter, which controls the balance between the two.
It is worth noting that we generally do not punish θ0 \theta _{0}.

For the above examples, the comparison between regularization and non-regularization is as follows:

Now let's explain the λ.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.