1. Linear regression (linear regression):
B, multivariate linear regressionMultivariate linear regression:
The form is as follows:
The order is therefore: there are parameters: Then, the cost function (the price functions) is:
Note: N:number of features (total number of features) M:number of training examples (number of training set data): ITH training Example (Part I training data): Feature J in ith training Exmaple (the first J of the training data) to minimize the computational results of the cost function, it is still possible to use the previously mentioned
Gradient descent algorithmSolver parameters: Gradient descent:repeat{} (simultaneously update for every j = 0, ..., n) is repeat{}
Differential ItemUnfold as follows:
...
At this point, the gradient descent algorithm is slightly more complex, but using
Feature ScalingMethod--(to ensure that the values of different characteristics are in a similar range), can make the gradient descent method faster convergence:
Due to the influence of the volume of the gang, features may tilt toward one side, such as Figure 1, at which point the gradient drop is significantly slower than Figure 2, and the removal of the volume, so that the range of eigenvalues are in a similar range, such as [ -1,1],[0,1] and so on, of course, the scope should not be too small or too large. Therefore, you can use the mean normalization to achieve feature scaling: (x-average)/(Max-min) Next, look at
Learning Rate, how to determine whether the algorithm is working properly and how to choose the learning rate, with J value and the number of iterations to draw:
If the gradient descent algorithm works, then the value of the cost function J (θ) will be smaller as the number of iterations increases (as shown in Figure 3), and when the curve slowly flattens out, it shows convergence. (so it can be used to determine whether the algorithm works correctly and whether the iteration converges)
If the gradient descent algorithm is not working correctly, as shown in Figure 4, you should choose a smaller learning rate. (not working properly if it's not a code problem, then it may be that the learning rate is too high and the stride is too large to move toward the smaller J value in the middle, as shown in Figure 1 and Figure 2) In addition, there is a possibility that figure 5 is not working correctly:
Figure 5 also shows that the learning rate is too large to always cross the minimum, and that it should also choose a smaller learning rate.
Figure 6 shows that the learning rate is too small, convergence will be very slow, at this time should choose a larger learning rate.
In addition to using the gradient descent algorithm, you can also use the
Normal equationMethod solves the problem (the optimal value of a one-time solution θ): M Examples: (), (),..., (), n features then (M * (n+1) dimensional matrix) (n + 1 D vector) (called: Design matrix)
(m-dimensional vector)
(n + 1 D vector) There are:--(X's transpose times x) 's Inverse times X's transpose and then multiplied by Y can get the optimal solution of the parameter. Using octave software: The gradient descent algorithm and the normal equation method are compared:
Suppose to have m training and n characteristics |
Gradient Descent |
Normal equation |
Need to choose the learning rate |
No need to choose the learning rate |
Normalized (very important) |
No normalization required |
Requires iterations |
No iterations |
When n is very large, the algorithm can still work well |
You need to calculate the x's transpose times the inverse of X, which is very slow when n is very large (for example, n=10000) |