In the previous chapter, we briefly introduced the general situation of machine learning, and today we begin to learn about the algorithms in machine learning in a progressive way. Before touching the classical algorithm, we first understand the "gradient descent" algorithm.

First, the algorithm background

As a background to the algorithm presentation, we still use the relationship between housing price and house size mentioned in the previous chapter, and we want to predict the price of a new sample by analyzing the existing data samples to match the prices of different housing sizes. This is actually a simple application of the regression problem in supervised learning, which is called a linear regression problem if all the attribute dependent variables are a one-time relationship. Here is our basic approach to solving the problem:

650) this.width=650; "src=" http://blog.csdn.net/windhawkgyang/article/details/44778739 "/>

First, we use the learning algorithm from the training focus, get a hypothesis about the problem (essentially H (X) =y mapping), then choose a new house, its size predicts its price. To solve this problem you also need to specify some symbols to facilitate subsequent explanations, such as:

1. M: Sample set, representing all training sets;

2. X: Input variable, also called feature, this example is the size of the house;

3. Y: Output variable, also called targeted variable, this example is the house price;

4. (x, Y): Represents an instance of a training sample;

5. Example I sample: 650) this.width=650; "Src=" Http://blog.chinaunix.net/attachment/201503/18/26275986_142667259216YT.png " height= "Width="/>;

Second, gradient decline

For this example of house price and house size above, if we only consider linear relations, then Y should be the linear relationship of X. As a demonstration example, we might as well set the house size, the room quantity is the house price two linear dependent variable, then our hypothesis h (x) can actually write about the X function, for the convenience, we use the following formula and the result to write below:

650) this.width=650; "src=" Http://blog.chinaunix.net/attachment/201503/18/26275986_1426673853l4LA.png "height=" 319 "Width="/>

(1) The expression is when the house size and the number of rooms two dependent variables of the linear function, wherein the value of X0 is 1,x1 to indicate the size of the house, X2 the number of rooms;

(2) in the type of J function, is our objective function, that is, if there is a function can best fit the existing training data set M, then all samples on the function of the variance must be minimal, the front multiplied by 1/2 is for the convenience of the subsequent operation simplification, do not need to scrutiny;

(3) equation is what we call the gradient descent algorithm update formula, the existing training set of all samples taken into account, then the variable becomes two coefficients, so we constantly change coefficient 650) this.width=650; "src=" http// Blog.chinaunix.net/attachment/201503/18/26275986_1426674210mzf0.png "height=" width= "/>" to seek a value that allows the H function to reach the minimum value, Here's 650) this.width=650; "src=" Http://blog.chinaunix.net/attachment/201503/18/26275986_1426674267BqgS.png "height= "Width="/> is a constant variable, indicating the step size of each descent, if the value is too small, resulting in the algorithm convergence time is too long, if the value is too large, it is possible to cross the minimum point;

(4) formula is only considered when only one sample is given for the gradient descent;

(5) formula is a gradient descent equation considering the time of M samples;

(6) equation is a stochastic gradient descent formula;

Attention! The idea of gradient descent is actually very simple, if all the sample values are plotted as contour plots, then like a person standing at one point, every time you want to take a step forward to consider which direction can be the fastest downhill. The quickest way to go downhill is in fact the partial derivative of the point. Specific

650) this.width=650; "src=" Http://blog.chinaunix.net/attachment/201503/18/26275986_1426674554714k.png "height=" 380 "width=" 583 "/>

Similarly, we can also look at contour plots:

650) this.width=650; "src=" Http://blog.chinaunix.net/attachment/201503/18/26275986_1426674610JGhH.png "height=" 388 "width=" 589 "/>

From the above (5) we can see that each time the coefficient is updated, we need to traverse all the sample set to operate, so called "batch gradient algorithm", but if you encounter a very large sample collection, this is undoubtedly very inefficient. So we have a "random gradient algorithm", the idea is to only use a sample to update all parameter values, but need to update m*n times (M is the size of samples, n is the number of variables), pseudo-code is:

Repeat {

For j = 1 to M {

(6) Type (for all i)

}

}

The stochastic gradient algorithm converges faster than the batch gradient, but the final convergence may not be as accurate as the batch gradient, and may swing back and forth around the minimum point. Because the core of the gradient algorithm is minus the partial derivative, the gradient algorithm must have convergence value, and for the linear problem, the local minimum value is often the global minimum.

This article is from the "Windhawk" blog, make sure to keep this source http://windhawk.blog.51cto.com/729863/1632860

"Machine Learning" (2): Gradient descent algorithm