Knowledge points in this section:
Bayesian Statistics and Normalization
Online learning
How to use machine learning algorithms to solve specific problems: setting up diagnostic methods to quickly identify problems
Bayesian statistics and normalization (methods to prevent overfitting)
is to find better estimation methods to reduce the occurrence of over-fitting conditions.
In retrospect, the estimation method used in linear regression is least squares, and logistic regression is the maximum of conditional probabilities.
Likelihood estimation, naive Bayes is the maximum likelihood estimation of joint probability, SVM is two times planning.
Turn from: Http://52opencourse.com/133/coursera
Stanford University Machine learning seventh "regularization" study notes, this course mainly includes 4 parts:
1) The Problem of Overfitting (overfitting problem)
2) Cost function
3) regularized Linear Regression (regularization of linear regression)
4) regularized Logistic Regression (regularization of logistic regression)
The following is a detailed explanation of each part.
1) The Problem of Overfitting (overfitting problem)
Example of fitting problem-price problem of linear regression:
A) Under-fitting (Underfit, also known as High-bias)
b) Suitable fitting:
c) overfitting (Overfit, also known as high variance)
What is overfitting (Overfitting):
If we have a lot of features, then the hypothesis that we learn might fit very well into the training set (), but it is poorly predicted for new data.
overfitting Example 2-Logistic regression:
Similar to the previous example, in order the under-fitting, fitting, and over-fitting:
A) less than fit
b) Fit fit
c) over fitting
How to solve a fit problem:
First, over-fitting problems often originate from too many characteristics, such as house price problems, if we define the following characteristics:
Well, for the training set, the Fit will be perfect:
Therefore, there are two ways to solve the problem of overfitting:
A) reduce the number of features:
-What characteristics are retained by artificial selection;
-Model selection algorithm (after the course will be introduced)
b) Regularization
-Retains all characteristics, but decreases the amount/value of the parameter;
-The advantage of regularization is that when features are many, each feature will contribute a suitable force to the prediction of Y;
2) Cost function
Still starting with the problem of house price forecasts, this time with the polynomial regression:
A) suitable fitting:
b) over fitting
Intuitively, if we want to solve the problem of overfitting in this example, it is best to eliminate the effect, that is, to let.
Assuming we punish and make it small, an easy way is to add two penalty items to the original cost function, for example:
This is when the cost function is minimized.
Regularization:
The advantages of taking a smaller number of parameters:
-Hypothesis of "simplification";
-Not easy to fit;
For price issues:
-Features include:
-Parameters include:
We punish the parameters except the thought, namely regularization:
Formal definition-The regularization cost function has the following form:
Which is called regularization parameters, our goal remains minimized:
For example, for a regularization linear regression model, we chose to minimize the following regularization cost function:
If it will be set to a maximum value (for example, for our problem, set)? So
-The algorithm will still work normally and will be set up very large without affecting the algorithm itself;
-The algorithm will fail on the problem of removing overfitting;
-The structure of the algorithm will be less than fit (underfitting), even if the training data is very good will fail;
-Gradient descent algorithm does not necessarily converge;
In this case, except that the other parameters are approximately equal to 0, you will get an under-fitted graph similar to the following:
With regard to regularization, the following quote from Dr. Hangyuan Li, "Statistical learning Method", section 1.5, describes some of the regularization:
The typical approach to model selection is regularization. Regularization is the implementation of the structure risk minimization strategy, which is to add a regularization item (Regularizer) or penalty (penalty term) to the empirical risk. Regularization is generally a monotonically increasing function of model complexity, and the more complex the model, the greater the regularization value. For example, a regularization term can be a norm of a model parameter vector.
Regularization conforms to the principle of the Razor razor (Occam ' s). The application of the OCA Razor principle to model selection becomes the following idea: In all possible models, it is very easy to interpret the known data and be very simple, which is the model that should be chosen. From the Bayesian estimation point of view, the regularization term corresponds to the prior probability of the model. It can be assumed that a complex model has a greater priori probability, and a simple model has a smaller priori probability.
3) regularized Linear Regression (regularization of linear regression)
The linear regression includes the cost function, the gradient descent algorithm and the normal equation solution and so on, the unclear reader may review the second lesson and the third lesson's note, here will introduce the regularization linear regression's cost function, the gradient descent algorithm and the normal equation and so on respectively.
First, consider the cost function after the linear regression regularization:
Our goal remains to be minimized, thus obtaining the corresponding parameters. Gradient descent algorithm is one of the optimization algorithms, because the regularization of the linear regression cost function has changed, so the gradient descent algorithm also needs to be changed accordingly:
Note that for parameters, the gradient descent algorithm needs to be differentiated and.
The expression of the same normal equation also needs to be changed, for:
X is the M * (n+1) matrix
Y is the M-dimensional vector:
The formula for the normal equation of a regular linear regression is:
Assuming that the sample number m is less than or equal to the number of signatures x, if there is no regularization, the linear regression normal eqation is as follows:
What if it's not reversible? The previous approach is to delete some of the redundant features, but after the linear regression is regularization, if the previous formula is still valid:
where the matrix in parentheses is reversible.
4) regularized Logistic Regression (regularization of logistic regression)
Similar to linear regression, the cost function of logistic regression also needs to add a regularization item (penalty), and the gradient descent algorithm also needs to treat the parameter \ (\theta) differently.
Once again, we look at some of the logic regression overfitting, describing the following example:
One of the hypothesis is this:
The cost function after the regularization of the logistic regression is as follows:
The gradient descent algorithm is as follows:
which.
References:
Seventh lesson "regularization" of the courseware download link, video can be viewed or downloaded on the Coursera machine learning course: HTTPS://CLASS.COURSERA.ORG/ML
PPT PDF
Dr. Hangyuan Li, "Statistical learning method"
Http://en.wikipedia.org/wiki/Regularization_%28mathematics%29
Http://en.wikipedia.org/wiki/Overfitting
Online learning
The previous algorithms are batch processing algorithms, that is, after the model is obtained on the training set, the test set or training set itself is evaluated, and the training error and generalization error are obtained. And online learning is not the case, but first there is an initial classifier, when the first sample arrives, the sample is predicted, the prediction results, and then use the information of the sample to update the classifier (for example, consider the PERCEPTRON algorithm update rules, see note 1-2); Then the second sample arrives with the same action, And so on In this way, we have a predictive value for the M-samples, but they are all obtained during the training process, and the online training error is obtained by counting the predicted values. This is the difference between online learning and batching in the process.
For the perceptron algorithm, the online learning algorithm is convergent if the positive and negative samples are linearly divided.
The following transfers are from: HTTP://BLOG.CSDN.NET/STDCOUTZYX
Bayesian statistical regularization of "CS229-LECTURE11"