**Two. Least squares**

We explain the least squares by the simplest linear model of one element. What is a unary linear model? In supervised learning, if the predicted variable is discrete, we call it classification (e.g. decision tree, support vector machine, etc.), if the predicted variable is continuous, we call it regression. In regression analysis, if you include only one argument and one dependent variable, and the relationship between the two can be approximated by a straight line, this regression analysis is called unary linear regression analysis. If the regression analysis includes two or more two independent variables, and the dependent variable and the independent variable are linear, then the multivariate linear regression analysis is called. For two-dimensional space linear is a straight line, for three-dimensional space linear is a plane, for multidimensional space linear is a super plane ...

For a unary linear regression model, assume that n groups of observations (X1,y1), (X2,y2), ..., (Xn,yn) are obtained from the population. For these n points in a plane, you can use an infinite number of curves to fit. The sample regression function is required to fit this set of values as well as possible. Together, this line is the most reasonable in the central position of the sample data. The criteria for selecting the best fit curve can be determined as follows: The total fit error (i.e. total residuals) is minimized. The following three criteria can be selected:

(1) It is a way to determine the linear position with "residual and minimum". But it was soon found that the calculation of "residuals and" there was a problem of offsetting each other.

(2) It is also a way to determine the straight position with "absolute residuals and minimum". But the calculation of absolute value is more troublesome.

(3) The principle of least squares is to determine the linear position with "residual squared and minimum". In addition to the least-squares method, the obtained estimators have good properties. This method is very sensitive to outliers.

Common least squares (ordinary Least square,ols) are most commonly used: The selected regression model should minimize the sum of the residuals of all observations. (q is the sum of squares of residuals)-the square loss function is used.

Sample regression Model:

where ei is the error of the sample (Xi, Yi)

Square Loss Function:

Then, by using the Q-min to determine the straight line, which determines that the variable is considered as the function of Q, it becomes an extremum problem, which can be obtained by the derivative number. Ask Q for a partial derivative of two parameters to be evaluated:

According to mathematical knowledge, we know that the extreme point of a function is a point with a bias of 0.

Solution to:

This is the method of least squares, which is to obtain the extremum point of the square loss function.

**Four. Least squares and gradient descent method**

The least squares and the gradient descent are the minimum values of the loss function by derivation, what is the difference between them?

Same

1. Essentially the same: both methods are based on the given known data (independent & dependent variables) to the dependent variables calculated a general valuation function. The dependent variables of the given new data is then estimated.

2. The same goal: all within the framework of the known data, so that the total squared difference between the estimated value and the actual value is as small as possible (in **fact, it is not necessarily necessary to use the square, in the later post on the gradient rise, is the logistic regression** ), the formula of the total squared difference between the estimated value and the actual value:

The independent variable for the group I data, the dependent variable for the group I data, is the coefficient vector.

Different

1. The implementation method and the result are different: the least squares is the direct derivation to find the **global minimum** , non-iterative method. The gradient descent method is an iterative method, which is first given and then adjusted to the fastest descending direction, and the **local minimum** is found after several iterations. The disadvantage of gradient descent method is that the convergence speed slows down to the minimum point, and the selection of the initial point is very sensitive, and the improvement is mostly in these two aspects.

Least squares learning (shared from other bloggers)