From the beginning of last week, suddenly remembered the regularization of such a thing, has always heard the addition of a norm to prevent overfitting, regularization why so magical?
After a week of related books, blogs, decided to first a short summary, after a more in-depth understanding to add.
What is over-fitting the first norm regularization term defines the model change based on the interpretation of the Occams razor the second norm regular term defines the model change condition number re-encounter Bayesian summary
What is overfitting
Take a picture first:
Overfitting refers to the fact that model learning is so complex that the model is well-predicted for known data, but poorly predicted for unknowns. In other words, the model behaves well on the training set (the mean variance is optimized to 1e-10), but it is pretty bad on the prediction set. As we can see from the above figure, the graph on the right is obviously over-fitted, and the optimization goal of the model on the training set can be optimized to be excellent, but it is not useful for the data to be predicted next.
Next, we use the logistic regression model as the basis to see how the regularization term affects our model by seemingly simple norms to prevent overfitting. The first norm regularization term Definition:
The first norm defines l1:∑ni=1|wi| L_1:\sum_{i=1}^{n}|w_i|, in short, is the absolute value of each parameter only. model changes:
The original LR model optimization goal is to minimize the loss function, namely:
L (W) =1n∑n1 (f (xi,w) −yi) 2 L (W) =\frac{1}{n}\sum_1^n (f (x_i,w)-y_i) ^2
Now let's add a one-norm regularization to it:
L (W) =1n∑n1 (f (xi,w) −yi) 2+λ| | w| | 1 L (W) =\frac{1}{n}\sum_1^n (f (x_i,w)-y_i) ^2+\lambda| | w| | _1 based on the interpretation of Occam ' s razor
Why add such an item, we first introduce the principle of the Ames Razor: In all possible models, it is best to interpret the known data and the very simple model, which is the model that should be chosen. From the computational process of Lagrange multiplier we can use the one-norm regularization term in the back as a limitation, now let's look at how this adjusts our model to reduce overfitting in a single graph:
The picture comes from the third chapter of PRML.
The model in the above is linear regression, there are two features, the parameters to be optimized are W1 and W2, the regularization of the left is L2, and the right is L1. Blue Line is the original objective function optimization encountered in the contour lines, a circle represents a target function value, restricted condition is the red boundary (that is, the regularization of that part), the first intersection of the two, we are looking for the optimal parameters.
You can see that the limiting area of the L1 l_1 norm is relatively sharp, the optimization target and the limiting area intersect at a greater probability in the cusp of the part, you can think so, you put a circle to a square move, certainly more probably first hit the corner of the square.
The intersection point falls on the axis to indicate what: 0 of the weights are generated.
Well, suddenly it's clear, L1 l_1 norm will make our model in the process of optimizing the objective function of the model is sparse (some features corresponding to the weight of 0), which is completely consistent with the shaver principle, the same optimization results, the simpler the better the model. second Norm Regular term Definition:
The first norm defines L1:∑ni=1 (WI) 2 l_1:\sum_{i=1}^{n} (w_i) ^2, in short, is the absolute value of a parameter only and. model Changes
Now let's add a one-norm regularization to it:
L (W) =1n