Machine Learning Lesson 1

Last Update:2014-09-29 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I recently learned a machine learning video from Andrew Ng at Standford University, so I want to make a summary of the methods I have learned, the algorithms mentioned later are commonly used in the machine learning field learned in the video. The algorithms we want to learn mainly include linear regression (linear regression), gradient descent (gradient descent method), normal equations (formal equations ), locally weighted linear regression (local weighted linear regression), logistic regression and Newton method (Newton method ). Since we started to study the knowledge of mechine learning, the previous Java, C ++, and ACM learning has come to an end. At the same time, due to the convenience of writing papers in the future, some technical terms in this article use English. I hope to learn and communicate with you more and wish you a happy National Day ~" Let's learn linear regression. In high school, we were all touched by the idea of linear regression. (As shown in) given a series of points, we need to fit a straight line to predict the positions of all points. This model is called linear regression, and the fitted straight line is a continuous line. If we do not provide a line corresponding to the input, an output will be given. Okay, now we have a problem. How can we fit this line accurately with a set of data. There are two common methods: 1. gradient Descent and 2. RMAL equations, in fact, these two methods have the same idea, except that normal equations are a equations that can be directly used by gradient descent in some special circumstances, we will discuss it in detail below. Let's discuss the mathematical model. Suppose we have an input (x1, x2) representing two features. Given θ (θ 1, θ 2) corresponds to the weights of each feature, which is our prediction output, namely:

The cost function is defined as: we can see in the formula that it is a quadratic function, and the gradient descent method is a method that iterates the quadratic function to obtain the minimum value. Next we will explain how to use gradient descent to solve the minimum value of the cost function. We will give the following formula:

Here is the parameter vector after each iteration update, which is the learning rate (step size of each iteration). Cost function is used to evaluate the deviation of parameters. It can be seen from the knowledge of high numbers, the deviation corresponds to the gradient direction, that is, the fastest direction of parameter variation. In this way, we can find the minimu value through continuous iteration until the function converges. As shown in (gradient contour line): the minimum value can be obtained by dropping continuously along the arrow. It is worth noting that if the requested cost function is not a "convex function", there are multiple extreme values, we may fall into "local optimum ". Furthermore, according to the derivation of the mathematical formula, the more gradual the gradient is, the slower the iteration is. (I didn't understand how to deduce it. That's what Daniel said anyway ). At this point, we have made it almost clear about the gradient descent method. Next, let's talk about normal equations: Let's first give the form of a formal equations: that is, we can directly find the parameter vector without iteration. The next step is the derivation process in the video handout. A large amount of mathematical formulas are involved here and paste them directly:

So far we have finished talking about the least square method normal equations. Next, we will briefly explain the locally weighted linear regression (local weighted linear regression ). Locally weighted linear regression is a non-linear regression method based on linear regression. The idea is as follows: given a set of input, we select the local area of each vertex to use linear regression in its local area to obtain a regression line, the regression line obtained by combining all the local regions is locally weighted linear regression. Next, let's learn about logistic regression. Logistic regressio is different from linear regression, which is a non-linear regression. Next we will discuss the implementation process. In linear regression, our prediction function:

In logistic regression, we define the prediction function:

The image is as follows:

As shown in the figure, the value range of the g (z) function is 0 ~ In this way, we can use the g (z) function to represent a probability and predict the output by finding the maximum value of this probability. So we introduced the maximum likelihood function in probability theory.

According to the above analysis, given Theta, the probability of X as input and output y is

At this time, we need not the minimum value of the target function but the maximum value. Therefore, this method is called gradient ascend (gradient rise method), so that we can find the output with the highest output probability given input. In practice, we often encounter binary classification problems, while the logistic regression function is a non-linear curve, so we can map it to 0 and 1.

This is the basic principle of Logistic regression. The maximum probability of the likelihood function is obtained by using the gradient rise method to obtain the optimal output. Since the nonlinear function sigmoid is introduced, the final regression is a non-linear line.

Finally, let's talk about the Newton method, which is a faster algorithm than the gradient rise method, and a common method used to solve the unrestricted optimization problem. First, let's look at the following set of images:

Given the function, we can see from the second figure that the initialized Theta is 4.5. How can we get the vertex through a small number of iterations? The method is to solve the tangent of the vertex where the current vertex Theta is located, find the intersection of the tangent and the X axis to update the value of Theta. The specific mathematical idea is as follows:

If we use the derivative instead of the likelihood function, we can find the theta value = 0, so that we can find the maximum point. The above functions can be written as follows:

It is called the hession matrix:

In this way, we can find the maximum Theta value of the likelihood function through constant updates. This is the idea of the Newton method.

Machine Learning Lesson 1

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Machine Learning Lesson 1

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Machine Learning Lesson 1

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support