3. Bayesian Statistics and regularization
Content
3. Bayesian statistics and regularization.
3.1 Underfitting and overfitting.
3.2 Bayesian statistics and regularization.
3.3 Optimize cost function by regularization.
3.3.1 regularized linear regression.
3.3.2 regularized Logistic regression.
3.4 Advanced Optimization.
3.1 underfitting and overfitting
The linear regression model and logistic regression model have been studied before, and they have been applied in many aspects, such as the use of linear regression model (also can be polynomial) for price prediction, logistic regression model spam classification and so on. However, there may be some problems in the application process, such as overfitting (overfitting), in contrast to the lack of fit (underfitting).
The so-called overfitting, simply put, is that the learning model we have designed is too powerful for the training sample, which leads to a good fit for the training sample. At this point, students may have doubts: fitting well is not a good thing, why still a problem? Note that the purpose of our learning model is not to fit the training sample, we train the model so that it can better predict the data that is not in the training set. The training set is a subset of the entire set of data we are studying, and we think it should have features like the rest of the data set, but at the same time it usually has its own unique characteristics. Therefore, if the learning model of learning ability is too strong, to learn the unique characteristics of training sets, training samples to fit too well, that is, over-fitting, then it may not belong to the training set but belong to our research data sets of data prediction is not good, that is, generalization ability ( Generalization) declined. And the lack of fitting, that is, the training sample is too poor to fit, even the data set we are studying have not learned the characteristics. Mathematically, the lack of fit will lead to a large deviation (bias), and overfitting will result in a large variance (variance).
The following is an example of predicting house prices in the linear regression of figure 3-1 and an example of the 0-1 classification in Figure 3-2logistic regression to visualize the under-fitting and overfitting.
Figure 3-1 Under-fitting and over-fitting in linear regression
Figure 3-2 and overfitting in logistic regression processing 0-1 classification problems
In general, under-fitting is a better solution, for example, in linear regression and logistic regression, we may either add new features or use a higher number of polynomial. But overfitting is more difficult to control, because it is very contradictory: we think that the chosen training set can represent the whole data set to a large extent, so we want the model to fit well, but we also know that the training set inevitably has no generalization characteristics. So more or less our learning model will learn the unique features of the training set. Nonetheless, there are measures to reduce the risk of overfitting.
- Reduce the number of features
- Try to choose features that we consider to be generalized, except for features that may only be in the training set. (artificial)
- Model selection algorithm is used in the models selection algorithm
- Regularization (regularization)
3.2 Bayesian Statistics and regularization
The basic idea of regularization is to retain all the feature quantities, but to avoid the effect of a certain feature quantity by reducing the parameter θ.
The following is an understanding of regularization from the Bayesian statistics (Bayesian statistical) school.
Before, we estimate the parameter θ by using the maximum likelihood method (maximum likelihood:ml), and then we get the cost function, and think that the value of θ should make the likelihood function the largest, and make the cost function the least, that is
So the maximum likelihood estimates that Theta is a parameter that we don't know, not a variable, which is the point of view of the frequency school (frequentist statistics). This view is that Theta is not random (naturally there is no random distribution), it is a constant, it should be equal to some value. So our job is to estimate it in a statistically significant way, such as maximum likelihood.
But the Bayesian school thinks that Theta is an unknown random variable, so before we train on the training set, θ is likely to obey some kind of distribution p (θ), which we call a priori probability (prior distribution). For a training set, if we are going to make predictions for the new, we can calculate the posterior probability (posterior distribution) of θ with the Bayesian formula, i.e.:
The above is the complete Bayesian prediction, but in fact it is difficult to calculate the posterior probability of θ, because the (1) type requires the integration of θ, and θ is often high-dimensional, so it is difficult to achieve.
So in practical applications we are often the posterior probabilities of approximate θ. A commonly used approximation is the estimation of a point in place of the (2) pattern. The MAP (maximum a posteriori) is estimated as follows:
We find that the (3) phase is compared to the maximum likelihood estimation, but only the transcendental probability of the θ is multiplied.
In practical applications, it is often assumed (and of course there are other hypothetical ways). In practice, the Bayesian MAP estimate better than maximum likelihood estimates to reduce overfitting. For example, using the Bayesian Logistic regression algorithm can be used to deal with more features than the number of training sample text classification problem.
3.3 Optimize cost function by regularization
Here's how to use regularization to refine the cost function. First look at an intuitive example. As shown in 3-3, the first because the polynomial is too high lead to overfitting, but if the cost function after the addition of 1000*theta3^2+1000*theta4^2, in order to achieve the least, in the optimization (iterative) process, Will make the theta3 and theta4 closer to 0, so that the polynomial after the two high-level effect is reduced, over-fitting has been improved. This is equivalent to the penalty for non-generalized features.
Figure 3-3 the intuitive feeling of regularization
3.3.1 regularized linear regression
In general, the cost function for the regularization of the linear model is as follows:
(Note that regularization does not include THETA0)
The lambda value should be appropriate, if too large (such as 10^10) will cause the theta to tend to 0, all the characteristics are not learned, resulting in less than fit. The value of the lambda is discussed later, and is now temporarily considered between 0~10.
Now that the cost function has changed, if you use the gradient descent method to optimize, you will naturally have to make the corresponding changes, as follows:
As another model of linear regression, the normal equation (the normal equations) can also be formalized in the following way:
Through the 1.2.3 section, we know that if the training sample number m is less than or equal to the number of features n, then x ' x is irreversible (using MATLAB PINV can get its pseudo-inverse), but if lambda > 0, then the lambda multiplied by the form of the matrix is reversible.
3.3.2 regularized Logistic regression
3.4 Advanced Optimization
In practical applications, we usually do not implement the gradient descent method to optimize the objective function, but instead use the programming language function library. For example, use the Fminunc function in MATLAB. So we just need to write a function to calculate the cost function and its derivative, to the logistic regression as shown below, (note that in matlab the vector subscript starts at 1, so the theta0 should be theta (1)).
MATLAB implementation of the logistic regression the function code is as follows:
function[J, Grad] =Costfunctionreg (Theta, X, y, Lambda)%costfunctionreg Compute Cost andgradient for logistic regression with regularization% J=Costfunctionreg (Theta, X, y, Lambda) computes the cost of using% theta as the parameter for regularized logistic re Gression andthe% Gradient of the cost w.r.t. to the parameters. M=length (y);% number of training examplesn= Size (X,2); % features Numberj=0; Grad=Zeros (Size (theta)); H= sigmoid (X * theta); % sigmoidfunctionJ= SUM ((y). * Log (h)-(1-y). * Log (1-h))/m + lambda * SUM (theta (2: N). ^2) / (2*m); Grad (1) = SUM ((h-y). * X (:,1)) /m;for i=2: N Grad (i)= SUM ((h-y). * X (:, i))/M + lambda * theta (i)/M;endend
The MATLAB code snippet at the time of invocation is as follows:
% Initialize Fitting Parametersinitial_theta= Zeros (Size (X,2),1);% Set regularization parameter lambda to1(you can vary this) lambda=1;% Set optionsoptions= Optimset ('GradObj',' on','Maxiter', -);% Optimize[theta, J, Exit_flag]=... fminunc (@ (t) (Costfunctionreg (t, X, Y, Lambda)), Initial_theta, options);
Stanford Machine Learning note -3.bayesian statistics and regularization