Stanford CS229 Machine Learning course NOTE I: Linear regression and gradient descent algorithm

Last Update:2015-07-16 Source: Internet

Author: User

Tags svm

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

It should be this time last year, I started to get into the knowledge of machine learning, then the introductory book is "Introduction to data mining." Swallowed read the various well-known classifiers: Decision Tree, naive Bayesian, SVM, neural network, random forest and so on; In addition, more serious review of statistics, learning the linear regression, but also through Orange, SPSS, R to do some classification prediction work. But the external said that they are engaged in machine learning or is not very confident, after all, and trained of all of you compared to these models, the algorithm's understanding can only be regarded as "know it but do not know why", use up always feel what is wrong.

So, early last year, NetEase Open class on the Stanford CS229 Course and the corresponding handouts downloaded down, but every time a want to learn, see each episode of 1 hours of the content on the hope that there is not enough time to learn the whole block. Fortunately, during the Spring Festival home, there is no other excuse not to learn, so to have this study notes ... As of this afternoon, just finish the first four lessons, listen to Andrew Ng to finish the related content of GLM generalized linear model. It's really a feeling brief encounter. I would like to recommend this course to all the students who see this article (although it's 07).

Three elements of machine learning

The three elements of machine learning are: model, strategy, algorithm.
This knowledge is derived from the "Statistical Learning method" written by Hangyuan Li. "Foreign teachers are good at visualizing the example, Chinese teachers are good at summing up" this sentence or very reasonable, appropriate induction to help clarify our thinking, I do not know whether we have similar doubts: "Linear regression is a model or an algorithm?" "," is SVM a model or an algorithm? "Next I will combine my own thinking, from the model, strategy, algorithmic perspective to organize the study notes."

Linear regression linear regression 1. Three elements

Model: is the conditional probability distribution or decision function to be studied.
I believe the linear regression model is familiar to everyone (high three o'clock):

Strategy: Follow what criteria to learn or choose the optimal model.
Students who have studied linear regression should remember to minimize the mean square error, the so-called least-squares (in the SPSS linear regression corresponding module called OLS namely ordinary Least squares):

Algorithm: Based on the training data set, according to the learning strategy, select the optimal model calculation method.
The calculation method of determining the value of each θi in the model is often attributed to the optimization problem. For linear regression, we know that it has an analytic solution, that is, the normal equations:

Because I do not do scientific research, so the derivation of analytic solution did not look closely. (I guess a lot of people may be seeing such a complicated deduction in the middle of the second episode and abandoning the learning.) Before deriving the analytic solution, Ng also introduces a very important algorithm:

2.gradient descent algorithm gradient descent algorithm

The metaphor in the course is very image, will use the fastest speed to minimize the loss function, compared to how to the fastest downhill, that is, every step should go down the steepest direction of the slope, and the steepest slope of the direction is the loss of the corresponding partial derivative of the function, so the rule of the algorithm iteration is:

where α is the parameter of the algorithm learning Rate,α the greater the magnitude of each step, the faster the speed will be, but it can lead to an inaccurate algorithm.
In addition, for the linear programming problem, the J function (the sum of squared errors) is usually bowl-shaped, so there will always be a global optimal solution without worrying too much about the algorithm converging to the local optimal solution.

When the sample size of the training set is greater than 1 o'clock, there are two algorithms:
Batch gradient descent batch gradient descent

Stochastic gradient descent (incremental gradient descent) random gradient descent

When the training sample size is very large, batch gradient descent each step to traverse the entire training set, the cost is great, and stochastic gradient descent only select one of the samples, so the latter faster than the former. In addition, although stochastic gradient descent may not converge, the results obtained in most cases in practice are a good enough approximation of the true minimum value.

3. Why do we use the sum of squared errors instead of absolute or other loss functions when choosing a strategy?

First we have to review the model and assumptions of linear regression:

ε (i) ∼n (0,σ2), random error ε obeys normal distribution (Gaussian distribution)
ε (i) is distributed IID, random error ε is independent of the same distribution
The conditional probability distribution of the target variable can then be obtained:

The likelihood function of the whole training set, and the logarithmic likelihood function are as follows:

Therefore, maximizing the logarithmic likelihood function is also the equivalent of minimizing

4.Locally weighted linear regression (loess)

The difference between loess and linear regression

One option for the weighted function w is

|x (i) the smaller the −x|, the more the weight w (i) is closer to 1; the larger the weight, the smaller
Tau is called the bandwidth parameter, control the speed of the weight decline, tau the greater the slower the decline

Loess is a non-parametric algorithm: for different input variables, you need to temporarily re-fit the parameters using the training set.
Linear regression is a parametric algorithm: the number of parameters is limited, after fitting the parameters can not consider the training set, direct prediction.
Loess can mitigate the need for feature selection (whether to add a higher item for a feature)

Summarize

Originally wanted to write the first four lessons in one breath of notes, but found that the amount is too big or separate write it. Summarize the Harvest:

Remember the three elements of machine learning and no more confusing models and algorithms
The gradient descent algorithm, including batch gradient descent and random gradient descent, is understood.
It is a good choice to understand why linear regression uses squared error and as a loss function from the perspective of maximizing the likelihood function.
Understanding a non-parametric algorithm, local weighted linear regression

Stanford CS229 Machine Learning course NOTE I: Linear regression and gradient descent algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More