Logistic regression model

Source: Internet
Author: User

http://blog.csdn.net/hechenghai/article/details/46817031

The main reference to statistical learning methods, machine learning in combat to learn. below for reference.

In the first section, the difference between logistic regression and linear regression is that linear regression is based on the linear superposition of the XI of each dimension of the sample x (the weighted coefficient of the linear superposition of WI is the parameter of the model) to get the predicted value of Y, and then to minimize the error of all the sample predicted Y and the real value y ' to obtain the model We see that the value y of the model here is the linear superposition of the XI of each dimension of the sample X, which is linear.

Y=WX (assuming w>0), the size of Y is incremented linearly with the size of each dimension of X, (x in order to conveniently take 1 dimensions):

Then take a look at our logistic regression model, where the model formula is: This assumes that w>0,y is superimposed on the dimensions of X and (here is the linear superposition W), (X to facilitate the 1 dimension):

We see that the value of Y does not change linearly with the size of the x overlay, but rather as a smooth change that changes quickly when X is superimposed and around 0, and when it is large or small, the x is superimposed and then larger or smaller, and the change in the Y value is almost already small. When x dimensions overlap and take infinity, y approaches 1, and when x dimensions overlap and take infinity, y tends to be nearly 0.

This variable and dependent variable form is called the logistic change. (Note that it is not that x dimensions and infinity, the Y value is approaching 1, which is based on w>0, (if w<0,n so y approach 0) and W is based on the sample training, may be greater than 0, may be a small 0, may also w1>0,w2<0 ... So this w value is the sample automatically trained, and therefore not to say you just x1,x2,x3 ... Each dimension is large, and the Y value approaches 1, which is wrong. It's not right to think intuitively, because you don't even have a sample, your model has a feature: When X is big, y is big. This strong hypothesis must be wrong. Because it is possible that the sample is very small when x is very large. )

So we see that in logistic regression, x dimensions overlap and (or x each dimension) is not linear with Y, but a logistic relationship. In linear regression, x dimensions overlap and Y, that is, Y and X are linear.

X the relationship between the dimensions and Y is not just this one, but it could be something else like:

Why the variable and the dependent variable to choose a logistic relationship, because here (1) We need y to represent the probability that y∈ (0,1). (2) We need x dimensions overlay and change amplitude around 0, and is non-linear change. And at a very large or very small time, almost unchanged, which is based on the probability of a recognition and needs. An example of sensibility, think of the extent to which your study effort has increased from 60 points to 80 points and 80 to 100 points is not linear. (3) The cost function to be formed after the formula of this relationship is a convex function.

So we chose the logistic.

As we have already said, we use logistic regression for the two classification problem (Y has only two values, a, B, or 1 and 0, which doesn't matter), and the regression model does not get the Y value corresponding to the predicted sample X (Note that In logistic regression here our lowercase y represents the category of a sample XI, while uppercase Y or Y (Xi) represents the probability that the logistic regression model predicts a sample XI to be 1. In fact, it is best to put y in other letters to avoid confusion, but it has been written here, after attention. ), but rather the probability of y=1 or y=0. We assume that the probability formula for Y=1 is:, then the probability of y=0 is. (Note we can also y=0 the probability formula for the previous one, here is arbitrary.) The difference here is that the final w parameter is different. Because our final w is trained, in any case, the model will show the characteristics of the sample. It's just that we're used to the probability of mapping Y (X) to Y=1 's logistic model.

Also note that here we are not to a XI all have to predict separately y=1 probability and y=0 probability. But for a XI, if its yi=1, then we use this formula to map the corresponding probability, if for a XI, if it's yi=0, then we use this formula to map the corresponding probability. is based on the value of Yi mapping out a probability.

Because our y is the probability that we cannot take advantage of the minimum error, we are using the logarithmic likelihood function of all samples:.

Yi represents the category of Xi's true affiliation (1 or 0). L (W) is the cost function. The value of the objective function here is related to W, which is the weight of the linear superposition of the X dimensions.

So what are the physical meanings of these weights? is the x dimension and the Y value relationship that direction, may sound very abstract, now look at the specific example (x is two-dimensional, but consider the constant term into a homogeneous after the X is three-dimensional, the first dimension is 1, so W is three-dimensional, the first dimension is a constant term, the size does not affect that direction, the main consideration of the Like what:

When W=[-15 0.2 0.2], the graph is: (a relationship of 45° direction)

When w=[-15-0.2-0.2], the graph is: (a relationship of -45° direction)

When w=[-15 0.2 0.01], the graph is: (a relationship of 0° direction (a logistic relationship formed along the x axis)

When w=[-15 0.01 0.2], the graph is: (a relationship of a 90° direction (a logistic relationship formed along the y axis))

Below we seek the extremum for L (W).

L (W) is the likelihood function of negtive. Only is negtive, can use the gradient descent method to seek the minimum value, if is not negtive, must use the gradient rise method to seek the maximum value, we generally do not use that. Also pay attention to our cost function either the least squares, the error mean square, the likelihood function, etc., all require the average, is the front plus 1/m. The iterative formula obtained by using gradient descent method is:, where WJ represents the J element of model parameter vector W.

α represents the learning rate, Yi represents the real label (label) of the I-sample vector, which is the category (0 or 1) of the first sample vector, and Y (Xi) represents the probability that the first sample vector of the regression model prediction is 1. Xij represents the first of the I-sample vector XI of the J element. Be careful not to forget Σ (i=1:m) (note that because the gradient descent gradient is the objective function of the model parameter vector w each of the first element of the biased guide, so here the value of each element of W is a one to find out.) Of course, in MATLAB, although the calculation of the data by the vector, but in essence, a calculation of each value of W, rather than the direct request of this vector is what?

We notice that this iterative formula is the same as the iterative formula for linear regression, except that the linear superposition of Y (xi) to Xi in linear regression is wxi=wxi, and the dimensions of the Xi are linearly superimposed WXi, and then a nonlinear mapping (that is, the logistic map) is performed. A nonlinear mapping to (0,1) between Y (Xi). So it can also be considered that the logstic regression is also the linear problem of processing, that is, the weighting coefficients of the linear superposition of each dimension, only the linear superposition of each dimension is obtained, and later, it is not compared with the class of Xi, but the probability of non-linear mapping to the owning class. Then the maximum likelihood function method is to find the model parameters (that is, the weights of the linear superposition of each dimension, and then the nonlinear mappings logstic there are no parameters).

Here is an example: the training sample is characterized by a score of two subjects for 80 students, and the sample Value Yi is for the corresponding student to be allowed to be admitted to university. After training, use a well-trained model to predict whether a student is allowed to be admitted to college. The data given in the university is allowed to be 1, not allowed to university for 0.

Program code (Gradient descent method):

   1:clear all; Close all; Clc
   2:x = Load (' Ex4x.dat '); % each row is a sample
   3:y = Load (' Ex4y.dat ');
   4: [m, n] = size (x);
   5:sample_num = m;
   6:x = [Ones (M, 1), X]; %x adds one dimension. Because the preceding text or the first few sections say
   7:% Plot the training data
   8:% use of different markers for positives and negatives
   9:figure;
  10:pos = Find (y = = 1); Neg = Find (y = = 0),%pos and neg are the vectors of the position ordinal of the Y element = 1 and 0 respectively
  11:plot (x (POS, 2), X (pos,3), ' + ')% with + denotes those yi=1 corresponding to the sample
  12:hold on
  13:plot (x (neg, 2), X (Neg, 3), ' O ')
  14:hold on
  15:xlabel (' Exam 1 score ')
  16:ylabel (' Exam 2 score ')
  17:itera_num=500;% Iteration Count
  18:g = inline (' 1.0./(1.0 + exp (-Z)); This is equivalent to creating a function g (z) =1.0./(1.0 + exp (-Z))
  19:plotstyle = {' B ', ' R ', ' G ', ' k ', ' b--', ' r--'};
  20:figure;% Creating a new window
  21:alpha = [0.0009, 0.001,0.0011,0.0012,0.0013, 0.0014];% Use these learning rates to see which is better
  22:for alpha_i = 1:length (Alpha)%alpha_i is 1, 2, ... 6, which represents the coordinates of the learning rate vector and the curve format vector: Alpha (alpha_i), Plotstyle (alpha_i)
  23:theta = Zeros (n+1, 1);%thera represents the weight coefficients of each element of the sample XI, represented in vector form, and initially 0, three-dimensional vector
  24:j = Zeros (itera_num, 1);%j is a vector of 100*1, and the nth element represents the value of the nth iteration cost function (with the logarithmic likelihood function of negtive,
  25:% because it is negtive, so it is to obtain the minimum value)
  
  27:z = x * theta;% This z is a column vector, each element is a linear overlay of each sample Xi and because X is all the sample, so here is not a sample calculation,
  28:% is the sum of all samples, so z is a vector that contains all the linear superposition of the sample XI. In the formula, there is a single sample representation, and in MATLAB all the samples come in.
  29:h = g (z); this h is the probability of mapping when the yi=1 corresponds to the sample XI. If a sample XI corresponds to a yi=0, the corresponding mapping probability is written as 1-h.
  30:j (i) = (1/sample_num). *sum (-y.*log (h)-(1-y). *log (1-h)); The vector representation of the% loss function here Jtheta is a 100*1 column vector.
  31:grad = (1/sample_num). *x ' * (h-y),% this is a vector form, we see Grad in the formula Gradj=1/m*σ (Y (Xi)-yi) Xij, written relatively rough,
  32:% Here (Y (Xi)-yi), Xij% are scalar, and in the program is in the form of vectors, so you can not directly copy the formula, so take a careful look at the corresponding changes in the code.
  33:theta = Theta-alpha (alpha_i). *grad;
  34:end

35:plot (0:itera_num-1, J (1:itera_num), char (Plotstyle (alpha_i)), ' LineWidth ', 2)

% It is important to convert the CHAR function to the package cell after the packet () index.

  36:% so you have to use the Char function to convert, or you can use the {} index, so you don't have to convert.
  
  38:hold on
  39:if (1 = = Alpha (alpha_i))% The result is the best when the alpha is 0.0013, then the theta value after the iteration is the desired value
  40:theta_best = theta;
  41:end
  42:end
  43:legend (' 0.0009 ', ' 0.001 ', ' 0.0011 ', ' 0.0012 ', ' 0.0013 ', ' 0.0014 ');
  44:xlabel (' Number of iterations ')
  45:ylabel (' cost function ')

46:prob = g ([1, 80]*theta);

% put [1, 80]*theta this to G (Z) obtained is in the exam1=20, exam2=80 conditions, the probability of passing (that is, y=1) the probability is how much.

  47:% draw out sub-interface
  :% only need 2 points to define a line, so choose, endpoints
  49:plot_x = [min (X (:, 2))-2, Max (X (:, 2)) +2];% two points can be drawn out of the line, here so take X1 coordinates is to make the direct view easier to include those sample points.

50:plot_y = ( -1./theta (3)). * (Theta (2). *plot_x +theta (1)); How do you draw the% interface? The problem is found in the x1,x2 coordinate map.

% of those bringing x1,x2 into 1/(1+exp (-WX)), making its value >0.5 (X1,X2)

  51:% coordinates form the area because we know 1/(1+exp (-WX)) >0.5

52:% means that the results of the region (X1,X2) allow the probability of >0.5 to the university, then the other area is not allowed to university, then 1/(1+exp (-WX)) =0.5 a closed

The equation for x1,x2 is the interface.

  53:% We found out later that this equation is a linear equation: W (2) x1+w (3) x2+w (1) =0

54:% Note We can't because this interface is a straight line, we think that logistic regression is a linear classifier, note that logistic regression is not a classifier, he does not classify the function,

% This logistic regression is used to

55:% forecast Probability, here has the classification function is because we have rigidly stipulated a classification standard: The >0.5 is classified as a class, the <0.5 is attributed to another kind. This is a very strong hypothesis,

% because originally we might have predicted a sample

56:% The probability of a category is 0.6, which is not a very high probability, but we still predict it as this category, just because it >0.5. So the last possible logistic regression plus this

% of the classifiers formed after the hypothesis

  57:% of the demarcation of the face of the sample classification effect is not very good, this can not blame logistic regression, because the logistic regression essence is not used to classify, but the probability of seeking.
  58:figure;
  59:plot (POS, 2), X (pos,3), ' + ')% of the interface are presented at the sample point, so the sample points are also drawn.
  60:hold on
  61:plot (x (neg, 2), X (Neg, 3), ' O ')
  62:hold on
  63:plot (plot_x, plot_y)
  64:legend (' admitted ', ' not admitted ', ' decision boundary ')
  65:hold off
  

When the learning rate takes different values, the number of iterations is different from the cost function:

When the learning rate is 0.0014, we see the image start to oscillate, indicating that the learning rate is too high.

When we apply logistic regression to the classifier, if we assume that P (Y (x) |x) >0.5 belong to a class, P (Y (x) |x) <0.5 is classified as another category. So the classifier sub-interface

When we see this graph, do we have a feeling that this logistic regression is more complex than linear regression and why the results are so bad?

First of all, we do not use this interface to evaluate the logistic regression model of good or bad!!! Let's talk about this for a minute, and we'll start by saying how this interface formula is generated.

How do you draw the interface? The problem is finding areas in the x1,x2 map that take x1,x2 into 1/(1+exp (-WX)), and make its value >0.5 (x1,x2) coordinates, because we know that 1/(1+exp (-WX)) >0.5 means the region (x1, X2) indicated that the results allow the probability of university >0.5, then the other area is not allowed to university, then 1/(1+exp (-WX)) =0.5 out of a x1,x2 equation is that interface. After we solve it, we find that this equation is a straight line equation: W (2) x1+w (3) x2+w (1) =0 Note that we cannot think of logistic regression as a linear classifier because this interface is a straight line, note that logistic regression is not a classifier, he does not have the function of classification, This logistic regression is used to predict probabilities, and here is the classification function because we rigidly stipulate a classification criterion: The >0.5 is classified as one category, and the <0.5 is attributed to another category. This is a very strong hypothesis, because we might have predicted that a sample belongs to a category of probability is 0.6, which is not a very high probability, but we still predict it as this category, just because it >0.5. So the last possible logistic regression plus the classifier formed after this hypothesis
The demarcation of the sample is not very good, this can not blame logistic regression, because the logistic regression essence is not used to classify, but the probability of seeking.

Logistic regression model

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.