Logistic Regression and Newton ' s Method
Job Link: http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=DeepLearning&doc=exercises/ex4/ Ex4.html
The data were 40 students admitted to the university and 40 children who did not enter the university in two test scores, and whether they passed the label.
Based on the data, these two Tests were established with a two classification model of whether or not to enter the university.
In the two classification task, we need to find a function to fit the feature input x and the predicted output Y, the ideal function is the step function, but because the step function does not have good mathematical properties (such as the conductivity), we can choose the sigmoid function.
Introduction to the basic formula of Newton's method
Using the S function
After the feature X is brought in, H (x) can be considered as the probability of the occurrence of Y, but can only be considered.
The loss Function J () is defined as:
The definition of this loss function is the maximum likelihood method, and we can get the maximum "logarithmic likelihood" of the regression model.
The great likelihood is that the greater the probability that each sample belongs to its true mark, the better.
We need to use Newton's iterative parameters to minimize the loss function.
Newton's iterative formula is as follows
The corresponding gradient formula is as follows:
The Haisen matrix is as follows:
where x (i) is a vector of n+1 dimensions, X (i) *x (i) ' Is (n+1) * (n+1) of the Matrix, and H (X (i)) and Y (i) are normalized after the data.
This formula in the code implementation, there is a little to note, the correct implementation of the following:
h=1./m.* (x ' *diag (h) *diag (1-h) ' *x);
where Diag (h) *diag (1-h) ' represents the generation of a (n+1) * (n+1) matrix, each diagonal element of a matrix is h* (1-h). (When I first struck the formula, I knocked it straight into the h* (1-h) ', and the results were not iterative)
Newton's method does not require a high number of iterations, which takes approximately 5-15 times. But the complexity of each iteration of Newton's method is O (n*3),
The gradient method requires a high number of iterations compared to the gradient descent method, but the complexity of each iteration is O (n).
So when choosing a numerical optimization algorithm for convex optimization, when n<1000, we prefer Newton's method, when n>1000, we choose gradient descent.
Code implementation:
CLC Clear All;close all;x= Load ('Ex4x.dat');%Load Data y= Load ('Ex4y.dat');%%%%--------------------Data preprocessing----------------------%%%%%%m=length (y); x= [Ones (M,1), x];% Sigma = STD (x);%Take Variance% Mu = mean (x);%take the mean value% x (:,2) = (X (:,2)-Mu (2)./Sigma (2);%Normalization% x (:,3) = (X (:,3)-Mu (3)./Sigma (3);%Normalization of G= Inline ('1.0./(1.0 + exp (-Z))'); Theta= Zeros (Size (x (1,:)))';% Initialize fitting parametersJ= Zeros (8,1); %Initialize loss function forNum_iterations =1:8h=g (X*theta);calculate S function Deltaj=1/m.*x'* (h-y);% gradientH= (1/m). *x'* DIAG (h) * DIAG (1-H) * x;J (num_iterations)=1/m*sum (-y'*log (h)-(1-y)'*log (1-h));%The calculation theta of the loss function of Newton method= theta-h^ (-1) *deltaj;%%parameter update end x1=[1, -, the]; H=g (X1*THETA)% forecast
The probability of the last prediction is that the child passed the test probability, for 0.332;
In addition, Newton's method does not require data normalization, of course, you can do so.
Deep Learning Practice Newton method to complete logistic regression