As a machine learning beginner, the understanding is limited, the expression also has many mistakes, hope that everybody criticizes correct.
1 Summary
This report is a summary and understanding of the first four sections of the Stanford University Machine learning program plus the accompanying handouts. The first four sections mainly describe the regression problem, and regression is a method of supervised learning. The core idea of this method is to get the mathematical model from the continuous statistic data, then use the mathematical model to predict or classify. The data processed by this method can be multidimensional.
The handout initially introduces a basic problem, then leads to the solution of linear regression, and then gives the probability explanation for the error problem. The logistic regression is then introduced. Finally, it rises to the theoretical level and proposes a general regression. 2 Problem Introduction
This example comes from Http://www.cnblogs.com/LeftNotEasy/archive/2010/12/05/mathmatic_in_machine_learning_1_regression_and_ Gradient_descent.html
Suppose there is a home sales data as follows:
Area (m^2) |
Sales price (million yuan) |
123 |
250 |
150 |
320 |
87 |
160 |
102 |
220 |
... |
... |
This table is similar to the price of housing around Beijing 5, we can make a figure, the x-axis is the size of the house. The y-axis is the price of the house, as follows:
If we come up with a new area, suppose we don't have a record of the price of the sale, what do we do?
We can use a curve to fit the data as accurately as possible, and then if there is a new input, we can return the value corresponding to the point on the curve. If you use a straight line to fit, it might look like this:
The green dots are the points we want to predict.
First, some concepts and commonly used symbols are given.
House Sales Record table: Training Set (training set) or training data (training) is the input data in our process, commonly called X
House Sales price : Output data, commonly called Y
fitted function (or hypothesis or model): General Write y = h (x)
number of entries in the training data (#training set): A training data is the dimension n of the input data consisting of a pair of input and output data (number of features, #features)
This example is characterized by a two-dimensional, and the result is one-dimensional. However, the regression method can solve the feature multidimensional, the result is one-dimensional discrete value or one dimension continuous value problem. 3 Learning process
Here is a typical machine learning process, first give an input data, our algorithm will be a series of processes to get an estimated function, this function has the ability to give a new estimate of the new data not seen, also known as building a model. Just like the linear regression function above.
4 linear regression
Linear regression hypothesis features and results satisfy the linear relationship. In fact, the expression of the linear relationship is very powerful, the effect of each characteristic on the results can be reflected by the preceding parameters, and each feature variable can be mapped to a function, and then participate in the linear calculation. In this way, the nonlinear relationship between the characteristic and the result can be expressed.
We use X1,X2. Xn to describe the weight inside the feature, such as the area of the x1= room, the direction of the x2= room, and so on, we can make an estimate function:
Θ is called a parameter here, which means to adjust the influence of each component in the feature, that is, whether the area of the house is more important or the location of the house is more important. In order for us to make X0 = 1, we can use vectors to represent:
Our program also needs a mechanism to evaluate whether or not theta is better, so we need to evaluate our H-function, which is called the loss function (loss functions) or the wrong function (error functions), to describe the degree of bad H function, below, We call this function a J function.
Here we can think of the error function as follows:
This error estimation function is to go to the sum of the estimated value of x (i) and the squared sum of the true value Y (i) as the error estimation function, and the 1/2 in front of it is for the sake of derivation, the coefficient is gone.
As for why Squared is chosen as the error estimation function, the source of the formula is explained from the perspective of probability distribution in the following handout.
How to adjust θ so that J (θ) obtains the minimum value there are many methods, including the least squares (min square), is a completely mathematical description of the method, and gradient descent method. 5 Gradient Descent method
After the linear regression model is selected, the model can be used for prediction only if the parameter θ is determined. However, Theta needs to be determined in the smallest case of J (θ). So the problem boils down to finding the minimum problem, using the gradient descent method. The biggest problem of gradient descent method is that it is possible to obtain a global minimum, which is related to the selection of the initial point.
The gradient descent method is carried out according to the following process:
1) First assign a value to θ, which can be random, or let Theta be a vector of all zeros.
2) Change the value of θ so that J (θ) is reduced in the direction of gradient descent.
The gradient direction is determined by the partial derivative of θ from J (θ), which is the inverse direction of the partial derivative, because the minimum value is obtained. Result is
There are two ways to iterate the update, one is the batch gradient descent, that is, the entire training data to obtain the error after the θ is updated, the other is the incremental gradient drop, each scan step to the θ to update. The former method can be continuously convergent, and the result of the latter method may be constantly hovering in the convergence place.
In general, the convergence rate of gradient descent method is still relatively slow.
Another way to directly calculate the result is the least squares method. 6 least Squares
The training features are represented as X-matrices, the results are expressed as Y vectors, and the linear regression model is still the same, and the error function is unchanged. Then theta can be derived directly from the following formula
However, this method requires that X is a column full rank, and the inverse of the matrix is relatively slow. 7 probability interpretation of using the error function as the sum of squares
Assuming that the predicted results of the features and the actual results have errors, then the predicted results and the real results satisfy the following formula:
In general, the error satisfies a Gaussian distribution with a mean of 0, which is a normal distribution. So the conditional probability of X and Y is
This estimates the result probability of a sample, but we expect the model to be the most accurate of all the samples, that is, the maximum probability product. Note that the probability product is the product of probability density function, and the probability density function of continuous function is different from the probability function of discrete value. This probability product becomes the maximum likelihood estimate. We want to determine θ when maximum likelihood estimates are maximized. Then the maximum likelihood estimation formula should be derivative, and the derivative result is both
This explains why the error function uses the sum of squares.
Of course, some assumptions are made in the derivation process, but this assumption conforms to the objective law. 8 linear regression with weights
The system in the error function of the linear regression mentioned above is 1 and has no weight. Linear regression with weights added weight information.
The basic assumption is that
Where the hypothesis conforms to the formula
where x is the characteristic to be predicted, the assumption is that the larger the sample weight from the X, the smaller the farther the effect. This formula is similar to the Gaussian distribution, but not the same, because it is not a random variable.
This method becomes a non-parametric learning algorithm because the error function varies with the predicted value, so that θ cannot be determined beforehand, and the prediction needs to be calculated once and feels similar to KNN. 9 Classification and logistic regression
In general, regression is not a classification problem, because regression is a continuous model, and is affected by the noise is relatively large. If you do not want to apply the entry, you can use logistic regression.
Logistic regression is essentially linear regression, except that a function map is added to the mapping of the feature to the result, that is, the feature is summed linearly, and then the function g (z) is used to predict the most hypothetical function. G (z) can map continuous values to 0 and 1.
The assumption function of logistic regression is as follows, the linear regression hypothesis function is just.
Logistic regression is used to classify the 0/1 problem, which is the two value classification problem that the prediction result belongs to 0 or 1. This assumes that the two value satisfies the Bernoulli distribution, i.e.
Of course, assuming that it satisfies the Poisson distribution, exponential distribution and so on can also, but more complex, the following will refer to the general form of linear regression.
As in the 7th section, the maximum likelihood estimate is still obtained, then the derivative is derived, and the result of the iterative formula is
You can see that it is similar to linear regression, but instead, it is actually mapped by G (z). 10 Newton method to solve maximum likelihood estimation
The methods of maximum likelihood estimation used in sections 7th and 9th are the methods of derivation iteration, and the Newton descent method is introduced, so that the results can be quickly convergent.
When the solution is required, if f is conductive, then the formula can be iterated
To iteratively solve the minimum value.
When applied to the maximum likelihood estimation, it becomes a problem to solve the probability derivative of maximum likelihood estimation.
So the iterative formula writing
When θ is a vector, the Newton method can be used to represent
Which is the Hessian matrix of NXN.
Although the Newton method converges fast, it is time-consuming to find the inverse of Hessian matrix.
When the initial point X0 near the minimum x, the convergence speed of Newton method is the fastest. But when X0 away from the minimum, Newton's method may not converge, even the descent can not guarantee. The reason is that the iteration point xk+1 is not necessarily the minimum point of the target function f in Newton's direction. 111-Like linear model
The reason for using logistic regression
The formula is supported by a set of theories.
This theory is a general linear model.
First, if a probability distribution can be expressed as
, then this probability distribution can be called an exponential distribution.
Bernoulli distribution, Gaussian distribution, Poisson distribution, beta distribution, Dietritt distribution all belong to exponential distribution.
In logistic regression, the Bernoulli distribution is used, and the probability of Bernoulli distribution can be expressed as
which
Get
This explains the logistic regression in order to use this function.
The main point of a general linear model is
1) to satisfy an exponential distribution of the parameters, then the expression can be obtained.
2) Given x, our goal is to determine, in most cases, what we actually want to be sure of, while. (The expected value in logistic regression is, so h is; the expected value in linear regression is, whereas in the Gaussian distribution, so the linear regression is h=).
3) Softmax regression
Finally, an example of using a general linear model is given.
Assuming that the predictive value Y has k, that is, y∈{1,2,..., k}
such as k=3, can be seen as an unknown message into the three categories of spam, personal mail or work mail.
Defined
So
Such