Source: https://www.cnblogs.com/jianxinzhou/p/4083921.html
1. The problem of overfitting
(1)
Let's look at the example of predicting house price. We will first perform linear regression on the data, that is, the first graph on the left. If we do this, we can obtain such a straight line that fits the data, but in fact this is not a good model. Let's look at the data. Obviously, as the area of the house increases, the changes in the housing price tend to be stable, or the more you move to the right, the more gentle. Therefore, linear regression does not fit training data well.
We call this situation underfitting or bias ).
These two statements are roughly similar, indicating that training data is not well fitted. The term "high deviation" is a term used in the early stages of machine learning. To solve this problem, it means that if the linear regression algorithm is used to fit the training data, this algorithm will actually produce a very large deviation or a strong bias. In the second figure, we add a secondary item in the middle, that is, we use quadratic functions to fit the data. Naturally, a curve can be fitted, and facts prove that the fitting effect is good. In another extreme case, if a cubic polynomial is used for fitting the dataset in the third figure. So here we have five parameters θ 0 to θ 4, so that we can also fit a curve. Through our five training samples, we can get a curve on the right. On the one hand, we seem to have a good fit for the training data, because this curve passes through all the training instances. However, this is actually a very distorted curve that keeps fluctuating. Therefore, in fact, we do not think it is a good model for predicting house prices.
Therefore, we call this kind of situation overfitting, also called variance ).
Like high deviations, high deviations are also a historical term. From the first impression, if we fit a higher-order polynomial, this function can fit the training set well (fit almost all the training data ), but this also faces the problem that functions may be too large, with too many variables.
At the same time, if we do not have enough datasets (training sets) to constrain the model with too many variables, overfitting will happen.
(2)
Excessive fitting usually occurs when there are too many variables (features. In this case, the trained equation always fits the training data well. That is to say, our cost function may be very close to 0 or 0. However, this curve does everything possible to fit the training data. This will make it unable to generalize to new data samples, so that it cannot predict the price of new samples. Here,The term "generalization" refers to the ability of a hypothesis model to be applied to new samples.New sample data is data that is not present in the training set.
Data is the data that does not appear in the training set.
Previously, we saw overfitting in linear regression. A similar situation applies to logistic regression.
3
So what should we do if there is an over-fitting problem?
Excessive variables (features) and a very small amount of training data can cause over-fitting. Therefore, there are two ways to solve the problem of over-fitting.
Method 1: Minimize the number of variables selected
Specifically, we can manually check each variable to determine which variables are more important and then retain those more important feature variables. As for which variables should be discarded, we will discuss later. This will involve model selection algorithms, which can automatically choose which feature variables to use and automatically discard unnecessary variables. This method is very effective, but its disadvantage is that when you discard some feature variables, you also discard some information in the problem. For example, maybe all feature variables are useful for predicting house prices. We don't really want to discard some information or discard these feature variables.
Method 2: Regularization
In regularization, all feature variables are retained, but the order of magnitude of the feature variables is reduced (the value of the parameter value is θ (j )). This method is very effective. When we have many feature variables, each of them has a slight impact on prediction. As we can see in the housing price prediction example, we can have many feature variables, each of which is useful, so we do not want to delete them, this leads to the occurrence of the regularization concept. Next we will discuss how to apply regularization and what is called regularization mean, and then we will begin to discuss how to use regularization to make learning algorithms work normally and avoid overfitting.
2. Cost Function
(1)
In the previous introduction, we see that if we use a quadratic function to fit the data, it gives us a good fit to the data. However, if we use a higher Polynomial for fitting, we may eventually get a curve that fits the training set well, but it is not a good result, because it overfits the data, it is not very general. Let's consider the following assumptions. We want to addPenalty itemSo that the values θ 3 and θ 4 are small enough.
Here I mean that the formula is our optimization goal, that is, we need to minimize the mean square error of the cost function.
For this function, we add some items, plus the square of 1000 multiplied by θ 3, and the square of 1000 multiplied by θ 4,
1000 is just a large number that I can write at will. Now, if we want to minimize this function, we need to make θ 3 and θ 4 as small as possible to minimize this new cost function. Because, if you add the 1000 x θ 3 item on the basis of the original cost function, this new cost function will become very large. Therefore, when we minimize this new cost function, we will make the value of θ 3 close to 0, and the value of θ 4 close to 0, just like we ignore these two values. If we do this (θ 3 and θ 4 are close to 0), we will get an approximate quadratic function.
Therefore, we finally fit the data properly. What we use is that the quadratic functions add some very small and contribute very small items (because the θ 3 and θ 4 of these items are very close to 0 ). Obviously, this is a better assumption.
2
More generally, the ideas behind normalization are provided here. This idea is that if our parameter values correspond to a small value (the parameter value is relatively small), we will often get a simpler assumption. In our example above, we only punish θ 3 and θ 4, so that both values are close to zero, so we get a simpler assumption, in fact, this assumption is a quadratic function. But more generally, if we punish other parameters like θ 3 and θ 4, we can often get a relatively simple assumption.
In fact, the smaller the value of these parameters, usually correspond to the smoother function, that is, a simpler function. Therefore, the issue of overfitting is not easy.
I know why the smaller the parameters correspond to a relatively simple assumption, which may not be fully understood for you now, but in the above example, θ 3 and θ 4 are very small, in addition, this can give us a simpler assumption. This example at least gives us some intuitive feelings.
Let's take a look at the specific example. We may have hundreds of features for house price prediction, which is different from the polynomial example we just mentioned, we do not know that θ 3 and θ 4 are the terms of Higher Order polynomials. Therefore, if we have one hundred features, we do not know how to choose a parameter with better relevance and how to reduce the number of parameters. So in regularization, what we need to do is to reduce all the parameter values of our cost function (in this example, the cost function of linear regression, because we do not know which one or more of them are to be reduced.
Therefore, we need to modify the cost function and add an item after it, just like this in square brackets. When we add an additional regularization item, we contract each parameter.
By the way, we do not punish θ 0, so the value of θ 0 is large. This is the sum of a Convention from 1 to n, not from 0 to n. But in practice, there will be only a very small difference, whether you include this θ 0 or not. However, according to the Convention, we usually perform regularization only from θ 1 to θ n.
The following is a regularization item.
In addition, lambda is called a regularization parameter here.
What Lambda needs to do is to control the balance between two different goals.
The first goal is to make assumptions better fit the training data.We hope that we can adapt to the training set well.
The second goal is to keep the parameter value small. (Using regularization items)
Lambda, the regularization parameter, needs to control the balance between the two, that is, the goal of balancing fitting training and the goal of keeping the parameter value smaller. So as to keep the assumption form relatively simple, to avoid excessive fitting. For our house price forecast, we will get a very curved and complex curve function by fitting a very high order polynomial we used before, now we only need to use the regularization target method, so you can get a more suitable curve, but this curve is not a real quadratic function, it is a smoother and simpler curve. In this way, we get a better assumption of this data. Once again, it is difficult to understand this part of content. Why does the influence of adding parameters have this effect? However, if you implement normalization in person, you will be able to see the most intuitive feeling of this impact.
3
In regularized linear regression, if the value of the regularization parameter λ is set to a very large value, what will happen?
We will very much punish the θ 1 θ 2 θ 3 θ 4... That is to say, we final punish θ 1 θ 2 θ 3 θ 4... To a very large extent, we will make all these parameters close to zero.
If we do this, our assumptions are equivalent to removing these items and leaving us with a simple assumption that only the house price is equal to θ 0, this is similar to fitting a horizontal line. For data, this is underfitting ). In this case, this assumption is a straight line of failure. For the training set, this is just a smooth straight line. It has no trend and does not tend to any value of most training samples. Another question about this sentence ?? One way to express this assumption is that there is a strong "bias" or a high deviation (BAIS), and the predicted price is only equal to θ 0. This is just a horizontal line for data.
Therefore, to make regularization work well, we should pay attention to some aspects and select a good regularization parameter λ. When we talk about multiple options later, we will discuss a method to automatically select the regularization parameter λ. In order to use regularization, next we will apply these concepts to linear and logistic regression, so we can avoid over-fitting them.
3. regularized Linear Regression
We have already introduced the price functions of ridge regression as follows:
For linear regression, we used two learning algorithms, one based on gradient descent and the other based on regular equations.
(1)
Gradient Descent:
(2)
The regular equation is as follows:
(3)
M (sample size) is smaller than N (number of features) or equal to n. Through the previous blog, we know that if you only have a small number of samples and the number of features exceeds the number of samples, the matrix xtx will be an irreversible matrix or a singluar matrix, or, in another way, this matrix is degenerate, so we cannot use the regular equation to find θ. Fortunately, normalization solves this problem for us. Specifically, as long as the regular parameters are strictly greater than zero, the following matrix can be proved:
It will be reversible. Therefore, regular expressions can take care of any irreversible xtx problems.
Therefore, you now know how to implement Ridge Regression and use it to avoid over-fitting, even if you have many features in a relatively small training set. This should allow you to better use linear regression on many issues.
In the next video, we will apply this regularization idea to logistic regression, so that we can avoid over-fitting logistic regression and thus achieve better performance.
4. regularized Logistic Regression
Regularized logistic regression is actually very similar to regularized linear regression.
Gradient Descent is also used:
If normalization technology is used in advanced optimization algorithms, we need to define costfunction for these algorithms.
For those methods what we needed to do was to define the function that's called the cost function.
The input of the custom costfunction is vector θ, And the return value has two items: the cost function jval and the gradient.In short, we need this custom function costfunction. For Ave ave, we can pass this function as a parameter to the fminunc system function (fminunc is used to calculate the minimum value of the function, replace @ costfunction as a parameter. Note that @ costfunction is similar to the function pointer in C.) fminunc returns the minimum value of the function costfunction without any constraints, that is, the minimum value of the price function jval provided by us, of course, returns the solution of the vector θ.
The above method is obviously applicable to regular logistic regression.
5. Conclusion
Through several recent articles, we can easily find that both linear regression and logistic regression can be solved by constructing polynomials. However, you will gradually find that more powerful non-linear classifiers can be used to solve polynomial regression problems. In the next article, we will discuss it.
[Pattern Recognition and machine learning] -- Part2 Machine Learning -- statistical learning basics -- regularized Linear Regression