Mathematics in machine learning-regression (regression), gradient descent (gradient descent) <1>

Source: Internet
Author: User

Mathematics in machine learning (1)-Regression (regression), gradient descent (gradient descent)

Copyright Notice:

This article is owned by Leftnoteasy and published in Http://leftnoteasy.cnblogs.com. If reproduced, please specify the source, without the consent of the author to use this article for commercial purposes, will be held accountable for its legal responsibility.

Objective:

Last wrote a about Bayesian probability theory of mathematics, the recent time is relatively tight, coding task is heavier, but still take time to read some machine learning books and videos, which is recommended two: one is the Stanford Machines learning public class, in VERYCD can be downloaded, Unfortunately there is no translation. But it can still be seen. Another is Prml-pattern recognition and machine learning, Bishop a good response to the book, but also in 2008, is a relatively new book.

A few days ago also ready to write a series of distributed computing, only write a beginning, and then to write this series. Later to see which side of the experience more, just write which series it. Recently, there are miscellaneous things that are related to machine learning, math-related, and distributed.

This series mainly want to be able to use mathematics to describe machine learning, want to learn machine learning, first of all to understand the mathematical significance, not necessarily to be able to easily and freely deduce the middle formula, but at least to know these formulas, or read some related papers can not read, This series will focus on the mathematical description of machine learning, which will cover but not necessarily limited to regression, clustering, classification and other algorithms.

Regression and gradient descent:

Regression in mathematics is given a set of points, can be used to fit a curve, if the curve is a straight line, that is called linear regression, if the curve is a two-time curve, is called two regression, regression there are many variants, such as locally weighted regression, logistic regression , wait, this is going to be in the back.

Using a very simple example to illustrate the regression, this example comes from a lot of places and is also seen in many open source software, such as Weka. Presumably, to make a housing value assessment system, the value of a house comes from many places, such as size, number of rooms (several rooms), location, orientation, etc., these variables that affect the value of the house are called features (feature), feature is a very important concept in machine learning, There are a lot of papers devoted to this stuff. Here, for the sake of simplicity, suppose our house is a variable that affects the size of the house.

Suppose there is a home sales data as follows:

Area (m^2) sales price (million yuan)

123 250

150 320

87 160

102 220

... ...

This table is similar to the price of the house around 5, we can make a chart, the x-axis is the size of the house. The y-axis is the price of the house, as follows:

If we come up with a new area, what do we do if we don't have a record of the price of the sale?

We can use a curve to fit the data as accurately as possible, and then if there is a new input, we can return the value corresponding to the point on the curve. If you use a straight line to fit, it might look like this:

The green dots are the points we want to predict.

First of all, some concepts and commonly used symbols, in different machine learning books may have a certain difference.

The house sales record form-training set (training set) or training data (training) is the input data in our process, commonly called X

House sales price-output data, commonly called Y

A fitted function (or a hypothesis or model), generally written as Y = h (x)

Number of entries for the training data (#training set), a training data consisting of a pair of input data and output data

Dimension of input data (number of features, #features), n

Here is a typical machine learning process, first give an input data, our algorithm will be a series of processes to get an estimated function, this function has the ability to give a new estimate of the new data not seen, also known as building a model. Just like the linear regression function above.

We use X1,X2. Xn to describe the weight inside the feature, such as the area of the x1= room, the direction of the x2= room, and so on, we can make an estimate function:

Θ is called a parameter here, which means to adjust the influence of each component in the feature, that is, whether the area of the house is more important or the location of the house is more important. In order for us to make X0 = 1, we can use vectors to represent:

Our program also needs a mechanism to evaluate whether or not theta is better, so we need to evaluate our H-function, which is called the loss function (loss functions) or the wrong function (error functions), which describes how bad the H function is, In the following, we call this function the J function

Here we can make one of the following error functions:

This error estimation function is to go to the sum of the estimated value of x (i) and the squared sum of the true value Y (i) as the error estimation function, and the 1/2 in front of it is for the sake of derivation, the coefficient is gone.

How to adjust θ so that J (θ) obtains the minimum value there are many methods, including the least squares (min square), is a completely mathematical description of the method, in the final part of the Stanford Machine Learning Open lesson will deduce the least squares of the formula source, This comes in a lot of machine learning and math books can be found, here does not mention the least squares, and talk about gradient descent method.

The gradient descent method is carried out according to the following process:

1) First assign a value to θ, which can be random, or let Theta be a vector of all zeros.

2) Change the value of θ so that J (θ) is reduced in the direction of gradient descent.

To be clearer, give the following figure:

This is a diagram representing the parameter θ with the error function J (θ), and the red part is a higher value for j (θ), and we need to be able to keep the value of J (θ) as low as possible. That's the dark blue part. Θ0,θ1 represents the two dimensions of the θ vector.

The first step in the above-mentioned gradient descent method is to give theta an initial value, assuming that the random initial value is the cross point on the graph.

Then we adjust the θ in the direction of the gradient descent, which will cause J (θ) to change in a lower direction, and the end of the algorithm will be until the θ falls to the point where it cannot continue to fall.

Of course, it is possible that the final point of the gradient descent is not the global minimum point, which may be a local minimum point, possibly the following situation:

The above picture is a local minimum point, which we have re-selected an initial point, it seems that our algorithm will be to a large extent by the initial point of choice affected by the local minimum point

Here I will use an example to describe the process of gradient reduction, for our function J (θ) biased guide J: (The process of derivation if you do not understand, you can review the calculus)

The following is the update process, that is, the θi will be reduced to the least direction of the gradient. Θi represents the value before the update,-the latter part represents the amount of decrease in gradient direction, and α indicates the step size, that is, how much to change in the direction of the gradient reduction each time.

A very important place to note is that the gradient is directional, for a vector θ, each dimension component θi can be a gradient direction, we can find a whole direction, in the change, we will be in the direction of the most downward change to achieve a minimum point, Whether it is local or global.

To describe in a simpler mathematical language step 2) is this:

Inverted triangle represents the gradient, in this way to express, θi is gone, look at the use of good vectors and matrices, really will greatly simplify the description of mathematics AH.

Summary and preview:

The contents of this article are mainly taken from the second episode of Stanford's course, I hope I can make it clear: the next article in this series will also be taken from the third episode of the Stanford Course, and the next one will be in-depth talk about regression, logistic regression, and Newton Law, However, this series does not want to be made into a Stanford course, and then it is not necessarily completely consistent with the Stanford course.

Mathematics in machine learning-regression (regression), gradient descent (gradient descent) <1>

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.