Thank Bo Pro World, notes too good, I will move directly over to add. Http://www.cnblogs.com/fanyabo/p/4060498.html
First, Introduction
This material references Andrew Ng's machine learning course http://cs229.stanford.edu, as well as the Stanford unsupervised Learning UFLDL Tutorial http://ufldl.stanford.edu/wiki/ Index.php/ufldl_tutorial
The regression problem in machine learning belongs to the supervised learning category. The goal of the regression problem is to give a D-dimensional input variable x, and each input vector x has a corresponding value Y, which requires the new data to be predicted for its corresponding continuous target value T. For example, let's say we have an area and price data set of 47 houses as follows:
Training set training set, all "definite answers", that is, data set
M: Number of training sets =47
x: Input variable (eigenvector)
Y: Output variable (target variable)
(x, y): a training sample; (× (i), Y (i)): the first characteristic sample; training set → function (hypothesis)
We can draw this set of datasets in MATLAB as follows:
I can then return the fitted curve to the corresponding point to achieve the purpose of prediction. If the value to be predicted is continuous, such as the above price, then it is a regression problem, if the value to be predicted is discrete, that is, a label,0/1, then it is a classification problem. This learning process is as follows:
Second, linear regression model
The linear regression model assumes that the input features and corresponding results satisfy the linear relationship. In the data set above, add one dimension-the number of rooms, and the dataset becomes:
Thus, the input feature X is a two-dimensional vector, such as X1 (i) that represents the area of the first house in the data set, and X2 (i) represents the number of rooms in the first house of the data set. You can then assume that the input feature x and the house price y satisfy the linear function, such as:
in all the questions, the final thing is to find θi.
Here θi is called the Assumption model, which maps the input feature x with the parameter of the linear function H of the result y parameters, in order to simplify the representation, we add x0 = 1 to the input feature, so we get:
Now, given a training set, how do we learn the parameter θ, and how can we see that the linear function fits well? an intuitive idea is to make the predicted value H (x) as close to Y as possible, for this purpose, we define a cost function for each parameter θ to describe the approximate degree of h (x (i)) ' and corresponding y (i) ':
The smaller the cost function, the better the linear regression (and the better fit of the training set), of course, the minimum is 0, which is fully fitted. Where: Represents the first element in Vector x, the first element in Vector y, the known hypothetical function, and m as the number of training sets;
Also called squared error function.
The 1/2 on the front is for derivation, so the constant coefficient disappears. So our goal is to adjust θ so that the cost function J (θ) obtains the minimum value, the method has the gradient descent method, the least squares and so on.
2.1 Gradient Descent method (→ minimum Value)
Gradient descent principle: the function is compared to a mountain, we stand on a hillside, looking around, from which direction down a small step, can fall the fastest; Of course there are many ways to solve the problem, gradient drop is only one, there is a method called normal equation;
Method: (1) first determine the pace to the next step size, we call learning rate, (2) Any given an initial value:; (3) Determine a downward direction, and go down the predetermined pace, and update; (4) When the descent height is less than a defined value, the drop is stopped;Batch Gradient descent method: Using all the data, each time toward the optimal solution forward, so the characteristic function h is constantly changing (
constantly changing), until j=0, indicating that the local optimal solution has been achieved
features: (1) The initial point is different, the minimum value obtained is also different, so the gradient descent is only the local minimum value, (2) The closer to the minimum, the slower the descent rate;∵ slope More and more small →a* derivative smaller →baby stepsIs overshoot minimum phenomenon:The cost function is often drawn with the contour plot contour, and the local optimal solution of the central point is
question: What happens if the initial value is in the location of the local minimum? A: Because it is already in the local minimum position, so the derivative must be 0, so will not change; question: How to value? A: Always observe the value, if the cost function becomes smaller, then OK, and vice versa, then take a smaller value; that is, the overshoot minimum phenomenon:over-rushed
The gradient direction is determined by the partial derivative of J (θ) to θ, because the minimum value is required, so negative values of the partial derivative are obtained in the gradient direction. Substituting J (θ) to get the total update formula
Such update rules are called LMS update rule (least mean squares), also known as Widrow-ho? Learning rule.
The algorithm for the following update parameters:
Since all samples of the training set are examined at each iteration, it is called batch gradient descent batch gradient descent. For the introduction of the price data set, run this algorithm, you can get θ0 = 71.27,θ1 = 1.1345, fitted curve such as:
If the parameter update calculation algorithm is as follows:
Here we update the value of θ according to a single training sample called the random gradient descent stochastic gradient descent. Comparing these two gradient descent algorithms, because batch gradient descent consider all data sets at each step, so the complexity is relatively high, the random gradient descent will be faster convergence, and in the actual situation, the two gradient drops of the optimal solution J (θ) is generally close to the real minimum value. Therefore, for larger datasets, a more efficient stochastic gradient descent method is generally used.
2.2 Least squares (LMS)
The gradient descent algorithm gives a method of calculating θ, but the iterative process is more time-consuming and less intuitive. The least squares method described below is an intuitive algorithm that directly utilizes matrix operations to obtain θ values. To understand the least squares, first look at the relevant operations of the Matrix:
Suppose that function f is an operation that maps a m*n-dimensional matrix to a real number, that is, and that the gradient of map f (a) to a is defined for matrix A:
So the gradient is the m*n matrix. For example, for Matrix a=, and the mapping function f (a) is defined as: f (a) = 1.5a11 + 5a122 + a21a22, then the gradient is:
。
In addition, for the gradient operation of the matrix Trace, there are the following rules:
。
Below, we will test the set of input characteristics x and the corresponding result y is expressed as a matrix or a vector form, there are:
,,
For the predictive model there is, that is, so easy to get:
,
So you can get it.
Therefore, we will use the cost function J (θ) in the form of a matrix, the above-mentioned matrix operation can be used to obtain the gradient:
,
Make the above gradient 0, get the equation: and get the value of theta:
。 This is the value of the parameters in the hypothetical model obtained by the least squares method.
2.3 Weighted linear regression
Some of the curve fitting cases are considered first:
The leftmost graph uses linear fitting, but you can see that the data points are not exactly in a straight line, so fitting is not a good effect. If we add the X2 item, we get, as shown in the middle, the two curves can better composite the stronghold. We continue to add higher items, we can get the fitting curve as shown on the rightmost graph, we can fit the data point perfectly, the curve on the rightmost graph is a 5-order polynomial, but we all know clearly that this curve is too perfect, and the new data may not be as good as it can be predicted. For the leftmost curve, what we call the under-fitting-too small feature set makes the model too simple to express the structure of the data, and the rightmost curve we call overfitting-too large a feature set makes the model overly complex.
As the above example shows, in the course of learning, the selection of features has a great impact on the performance of the final learning model, so the choice of which feature, and the importance of each feature, results in a weighted linear regression. In traditional linear regression, the learning process is as follows:
,
The weighted linear regression learning process is as follows:
。
The difference between the two is that different input characteristics are given different non-negative weights, the greater the weight, the greater the impact on the cost function. The generally selected weights are calculated as:
,
where x is the characteristic to be predicted, the greater the weight of the sample that is closer to the X, the less the effect is farther away.
Logistic regression and Softmax regression
3.1 Logistic regression
The logistic regression is described below, although it is called regression, but in fact logistic regression is used to classify problems. Logistic regression is essentially a linear regression model, which simply adds a layer of function mappings to the successive values of the regression, sums the features linearly, and then uses G (z) as a map to map the continuous values to the discrete value 0/1 (for the sigmoid function is 0/1 classes, And for the hyperbolic sine tanh function is 1/-12 Class). The assumption model is used as:
,
The sigmoid function, g (z), is:
When Z approaches to-∞,g (z) toward 0, and Z tends towards ∞,g (z) approaching 1, the purpose of classification is achieved. Here's
So how to adjust the parameter θ for such a logistic model? We assume that
, because there are two types of problems, that is, the likelihood estimates are:
The likelihood estimation logarithm can be solved more easily:.
The next step is to maximize the likelihood estimation of θ and consider the gradient descent method described above, and then get:
Get a similar update formula:. Although this update rule is similar to the formula given by the LMS, these two are different algorithms because the hθ (x (i)) here is a nonlinear function of θtx (i).
3.2 Softmax Regression
Logistic regression is an algorithm for two kinds of regression problems, what if the target result is multiple discrete values? Softmax regression model is to solve this problem, and the Softmax regression model is the generalization of logistic model in the multi-classification problem. In Softmax regression, class label Y can go to K different values (K>2). Therefore, for Y (i), the {1,2,3 K}.
For a given test input x, we use the hypothesis model to estimate the probability value p (y = j|x) for each category J. Then assume that the function hθ (x (i)) is in the form:
where θ1,θ2,θ3,,θk belongs to the parameters of the model, the coefficients on the right of the equation are normalized to the probability distribution, making the sum of the total probabilities 1. So similar to logistic regression, the generalization gets the new cost function as:
You can see that the Softmax cost function is very similar to the logistic cost function, except that the Softmax function accumulates the k possible categories, and the probability of dividing X into category J in Softmax is:
Thus, for the cost function of the Softmax, the minimum of J (θ) is made using the gradient descent method, the gradient formula is as follows:
Represents the partial derivative of J (θ) to the J element Θj, updated for each iteration:.
3.3 Softmax regression vs Logistic regression
In particular, when k = 2 o'clock in Softmax regression, Softmax is degraded to logistic regression. When k = 2 o'clock, the assumption model for the Softmax regression is:
We make ψ=θ1, and two parameters are cut off θ1, get:
Therefore, the probability form of the Softmax regression prediction is the same as that of logistic regression in two categories.
Now, if there is a K-class task, we can choose Softmax Regression, we can choose K Independent logistic regression classifier, how should we choose?
This choice depends on whether the K categories are mutually exclusive, for example, if there are four categories of movies, namely: Hollywood movies, Hong Kong and Taiwan films, Japanese and Korean films, mainland movies, need to each training film sample to tag, then you should choose k = 4 Softmax regression. However, if the four movie categories are as follows: Action, comedy, Love, Europe and America, these categories are not mutually exclusive, so it is reasonable to use 4 logistic regression classifiers in this case.
Iv. General linear regression model
First, a general exponential probability distribution is defined:
Considering the Bernoulli distribution, there are:
Consider the Gaussian distribution again:
The general linear model satisfies: 1. The y|x;θ satisfies the exponential distribution family E (η) 2. Given feature x, the predicted result is t (y) = e[y|x] 3. Parameter η=θtx.
For the second part of the linear model, we assume that the result y satisfies the Gaussian distribution ν (Μ,Σ2) and therefore expects μ=η, so:
It is obvious that the hypothesis model of the second part is obtained from the angle of the general linear model.
For the logistic model, due to the assumption that the results are divided into two classes, it is natural to think of the Bernoulli distribution, and can be obtained, so y|x;θ satisfies B (Φ), e[y|x;θ] =φ, so
The formula for the logistic hypothesis model is obtained, which explains why the logistic regression uses this function.
2nd Class_ Supervised Learning _ Linear regression algorithm