The problem of regression is raised
First, it needs to be clear that the fundamental purpose of the regression problem is prediction. For a problem, it is generally impossible to measure every situation (too much work), so we measure a set of data, based on this data to predict other non-measured data.
For example, the course gives the housing area, the number of rooms and the price of the corresponding relationship, such as the following table:
To measure all the cases, se years is not known. With this set of measurements, we have to estimate the price of a house, such as 2800 square feet of 5 rooms, when the regression algorithm can be honored.
Derivation of regression algorithm
with the above question, how to estimate the price of the house? The first model needs to be built, and the simplest model is the linear model, which is written as a function:
\ (H_\theta (x_1,x_2) ={\theta_0}+{\theta_1}{x_1}+{\theta_2}{x_2} \)
where \ (x_1\) is the house area, \ (x_2\) is the number of rooms, \ (h\) is the corresponding house area, \ (\theta_j\) is the coefficient we need to ask for.
For each specific problem, it is necessary to determine whether it is linear according to the condition of the measured data. This assumes that the linear model limits the scope of application, and that if the housing area is not linearly related to the price, the estimated house price of the model may be quite skewed. So it can actually be assumed that there are other relationships (such as exponential, logarithmic, etc.), then the estimate may be extremely inaccurate, and of course it's not linear regression, and there's no need to discuss it.
The above formula is written in vector form;
\ (H_\theta (x) =\sum_{i=0}^n{\theta_i{x_i}}=\theta^t{x} \)
which
\ (\theta= (\theta_0, \theta_1,..., \theta_n) ^t\)
\ (x= (1,x_1, ..., x_n) ^t \)
The above measurement data can be represented as \ ((x^{(1)},y^{(1)}), (x^{(2)},y^{(2)}),..., (x^{(m)},y^{(M)}), where y is the measured House area. So how to solve the parameter according to the M measurement data \ (\theta \) is the problem we need to solve.
We can constrain the solution by ensuring that the prediction error of this set of measurements is minimized. The cost function is
\ (J (\theta) ={\frac{1}{2}}\sum_{i=1}^m{(H_\theta (x^{(i)})-y^{(i)})
The cost function expresses the mean square error of the measured data and. By minimizing the cost function, you can estimate the parameters \ (\theta \). The previous 1/2 is not meaningful, mainly for the sake of the derivation of convenience plus, in fact, 1/m more absolute significance.
Regression algorithm Solving
How to solve the above problem? There are gradient descent method, Newton iterative method, least squares. Here we mainly talk about gradient descent method, because this method is used more later, such as neural network, reinforcement learning and other solutions are using gradient descent.
function to increase the fastest in the direction of its gradient, then to find the minimum value of the function, you can follow the gradient in the opposite direction to iterate to find. That is, given an initial position, look for the current position function to reduce the fastest direction, plus a certain step to reach the next position, and then look for the next fastest direction to reach the next position ... until it converges. The above procedure is expressed in a formula as follows:
\ (\theta_j = \theta_j-\alpha{\frac{\partial}{\partial{\theta_j}}}{j (\theta)}\)
According to the above expression, the partial derivative of the cost function can be obtained:
\ ({\frac{\partial}{\partial{\theta_j}}}{j (\theta)} = \sum_{i=1}^m{(H_\theta (x^{(i)})-y^{(i)}) {\frac{\partial}{\ Partial{\theta_j}}}{h_\theta (x^{(i)})}} = \sum_{i=1}^m{(H_\theta (x^{(i)})-y^{(i)}) x_j^{(i)}} \)
In this way, the iteration rule is
\ (\theta_j = \theta_j-\alpha\sum_{i=1}^m{(H_\theta (x^{(i)})-y^{(i)}) x_j^{(i)}}\quad (j=1,2,..., n) \)
This formula is called the batch gradient drop. Looking at the formula carefully, each iteration needs to calculate the M samples all over again, and if M is large, its iterations will be very slow, so a random gradient descent (or an incremental gradient descent) that calculates only 1 samples per iteration can greatly reduce the number of computations, with the following iterations:
\ (\theta_j = \theta_j-\alpha {(H_\theta (x^{(i)})-y^{(i)}) x_j^{(i)}}\quad (i=1,2,..., n \quad j=1,2,..., n) \)
If all sample iterations have not yet converged, the iteration continues from the 1th sample.
Algorithm implementation and results
First, use the following code to generate a set of data, for subsequent display convenience, the data for a straight line superimposed a certain noise:
1 N = 100; 2 x = rand (N, 1) * 10; 3 y = 5 * x + 5 * RANDN (N, 1); 4 Sample = [x y]; 5 Save (' Data.mat ', ' Sample ')6 figure,plot (x, y, ' o ');
View Code
The data is shown as:
Linear regression functions are solved using gradient descent:
1 % return value theta as regression result2 %iterinfo is the intermediate process information for the iteration, for debugging, viewing3 %sample for the training sample, one sample per action, and the last label value for each sample4 %batchsize number of samples for each batch iteration5function [Theta, Iterinfo]=linearregression (Sample, batchsize)6[m, n] = size (Sample);%m A sample, each n-dimensional7Y = Sample (:,End); %label8X = [Sample (:, 1: End-1), ones (m,1)] ';%x joins the constant entry 1 and translates to 1 samples per column9 Tenbatchsize = min (M, batchsize); OneTheta = Zeros (n, 1); ATheta0= Theta; -Alpha = 1e-2 * Ones (n, 1); -Startid = 1; the -Iterinfo.grad = []; -Iterinfo.theta = [Theta]; - + % gradient descent, iterative solution -Maxitertime = 5000; + fori = 1: Maxitertime AEndid = Startid + batchsize; at if(Endid > m) -TX = [X (:, Startid: M) X (:, 1: Endid-M)]; -TY = [Y (startid: M); Y (1: Endid-M)]; - Else -TX = X (:, Startid: Endid); -TY = Y (startid: Endid); in End - toGrad = Calcgrad (TX, TY, Theta); +Theta = Theta + Alpha. * GRAD; - the % record Intermediate results *Iterinfo.grad = [Iterinfo.grad Grad]; $Iterinfo.theta = [Iterinfo.theta Theta];Panax Notoginseng - % iterative convergence test theDelta = theta-theta0; + if(Delta ' * Delta < 1E-10) A Break; the End + -THETA0 = Theta; $Startid = Endid + 1; $Startid = mod (Startid, m) + 1; - End - theIterinfo. Time= i; - EndWuyi the % Gradient Calculation -function Grad = Calcgrad (X, Y, Theta) WuD = 0; - fori = 1: Size(x,2) AboutG = (Y (i)-Theta ' * x (:, i)) * X (:, i); $D = d + G; - End -Grad = D/size (x,2); - End
View Code
Test function:
1 % regression2Load (' Data.mat ');3 4batchsize = 100;5[Theta, Iterinfo] = Linearregression (Sample, batchsize)6 7 % display results, the following code is not common, the sample dimension increases when the display is not available8Figure,plot (Sample (:, 1), sample (:, 2), ' O ');9t = 0:0.1:10;Tenz = Theta (1) * t + Theta (2); OneHold on, Plot (t, Z, ' R ') A - fori = 1: Size(iterinfo.theta,2) -Err (i) = Error (Sample, Iterinfo.theta (:, i)); the End -Figure,plot (log (err), ' B ');Pause(. 1) - -[T1,t2]=meshgrid (0:0.1:20); + fori = 1: Size(t1,1) - forj = 1: Size(t1,2) +E (i,j) = Error (Sample, [T1 (i,j); T2 (I,J)]); A End at End -Figure,mesh (T1, T2, E); on -[r,c]=Find(E==min (min (E))); -PLOT3 (T1 (r,c), T2 (r,c), min (min (E)), ' rs ', ' Markeredgecolor ', ' B ',... -' Markerfacecolor ', ' R ',... -' Markersize '); on in -T1 = Iterinfo.theta (1,:); toT2 = Iterinfo.theta (2,:); + fori = 1: Size(iterinfo.theta,2) -Iterr (i) =error (Sample, Iterinfo.theta (:, i)); the End *PLOT3 (T1,t2,iterr, '--rs ', ' linewidth ', 1,... $' Markeredgecolor ', ' K ',...Panax Notoginseng' Markerfacecolor ', ' G ',... -' Markersize ', ten); hold on
View Code
In fact, the above code really involves the solution of the algorithm is not much, the other is to save intermediate results and drawings for debugging analysis. Regression result, the blue point is the data saved above, the red line is the regression fitting line:
After each iteration, the change in the cost function J is as follows (considering its scope is too large to draw its logarithmic plot):
As you can see, the cost function is basically the same when the iteration is over 1000 times. The gradient descent iterative process is as follows left, XY coordinates are \ (\theta_0 and \theta_1\), the z axis corresponds to \ (\theta\) The cost function value, the center of the Red Small block is the true optimal value, the green Square is the position of each iteration, you can see the iterative process is constantly close to the optimal solution. Because the green squares overlap too much in the diagram, the middle part of the drawing appears black, and the image on the right is the result of local amplification.
Algorithm analysis
1. In the gradient descent method,the batchsize is thenumber of samples used for one iteration, and when it is M, it is the batch gradient descent, which is the random gradient drop at 1 o'clock. The experimental results show that the larger the batchsize, the more time-consuming the iteration is, but the more stable the convergence is, the faster the iteration is, the more the oscillation phenomenon is, and the batchsize in the test code can be modified to see the experimental results.
2. About the choice of step size. In the gradient descent method, the effect of step length is very large, the step is too small can lead to convergence is very slow, too large will easily lead to non-convergence. The steps in the above program are modified by several times, and a different set of other data may not converge, this is the problem of the program, and after the completion of the regression algorithm will be devoted to an analysis of the problem, and give a solution.
Stanford Machine Learning Implementation and Analysis II (linear regression)