Theoretical knowledge Section:
The hypotheses function of Logistic Regression
In linear regression, if we assume that the variable y to be predicted is a discrete value, then this is the classification problem. If Y can only take 0 or 1, this is the problem with binary classification. We can still consider using regression method to solve the problem of binary classification. But at this point, since we already know y \in {0,1}, rather than the entire real field R, we should modify the form of the hypotheses function H_\theta (x) and use the logistic function to map any real number to the range of [0,1]. That
Where we have a linear combination of all feature first, i.e. \theta ' * x = \theta_0 * x_0 + \theta_1 * x_1 +\theta_2 * X_2 ..., then the linear combination of values into the logistic Function (also called sigmoid function) is mapped to a value within [0,1]. The image of the Logistic function is as follows
When z-> is infinity, the function value->1; when the z-> negative infinity, the function value->0. Therefore the new hypotheses function H_\theta (x) is always within the [0,1] interval. We also add a feature x_0 = 1 to facilitate the vector representation. The derivative of the Logistic function can be represented by the original function, i.e.
This conclusion will be used when learning the parameter \theta later.
2 study on model parameters of logistic regression with maximum likelihood estimation and gradient ascent method \theta
Given the new hypotheses function H_\theta (x), how do we learn the parameter \theta based on the training sample? We can consider using the maximum likelihood estimator (MLE) to fit data (MLE equivalent to the minimized cost function in the LMS algorithm) from the perspective of probabilistic assumptions. We assume that:
That is, using the hypotheses function H_\theta (x) to represent the probability of Y=1, 1-h_\theta (x) to represent the probability of y=0. This probability hypothesis can be written in a more compact form as follows
Suppose we observe the M training samples and their generation processes are independent and distributed, then we can write the likelihood function
Take the logarithm and turn it into Log-likelihood.
We now want to maximize Log-likelihood parameter \theta. In other words, when cost function J =-L (\theta), we need to minimize the cost function--l (\theta).
Similar to the gradient descent method we used to learn the linear regression parameter, we can use the gradient ascent method to maximize the Log-likelihood, assuming we have only one training sample (x, y), then we can get an update of the SGA (incremental gradient rise) Rule
It uses the properties of the derivative of the logistic function that is g ' = g (1-g). So we can get the parameter update rule
Here is the constant addition of a quantity, as is the gradient rise. \alpha is learning rate. The parameter LMS update rule is the same as the formal view and linear regression, but the substance is different, so the hypothetical model function H_\theta (x) is different. In the linear regression is only a linear combination of all feature, in the logistic regression, all feature linear combinations are first combined and then mapped into the interval [0,1] with the logistic function, that is, at this time H_ \theta (x) is no longer a linear function. In fact, these two algorithms are generalized Linear models special case.
Python implementation section (from the Machine Learning Combat Chapter fifth):
From NumPy import *def loaddataset (): datamat=[]; Labelmat=[] fr = open (' testSet.txt ') for line in Fr.readlines (): Linearr = Line.strip (). Split () Datama T.append ([1.0,float (linearr[0]), float (linearr[1])] labelmat.append (int (linearr[2))) return Datamat,labelmatdef Sigmoid (InX): Return 1.0/(1+exp (-inx)) def gradascent (datamatin,classlabels): Datamatrix = Mat (datamatin) Labelmat = Mat (Classlabels). Transpose () M,n = shape (datamatrix) Alpha = 0.001 maxiteration = weights = Ones ((n,1)) for k in range (maxiteration): h = sigmoid (Datamatrix * weights) Error = (labelmat-h) weights = Weights + Alpha * datamatrix.transpose () * Error return Weightsdef plotbestfit (weights): Import Matplotlib.pyplot as PLT Datamat,labelmat=loaddataset () Dataarr = Array (datamat) n = shape (Dataarr) [0] xcord1 = []; Ycord1 = [] Xcord2 = []; Ycord2 = [] for i in range (n): # get X, y locate at Xcord Ycord if inT (labelmat[i]) = = 1:xcord1.append (dataarr[i,1]); Ycord1.append (dataarr[i,2]) else:xcord2.append (dataarr[i,1]); Ycord2.append (dataarr[i,2]) FIG = plt.figure () ax = Fig.add_subplot (111) ax.scatter (Xcord1, Ycord1, s=30, c= ' red ' , marker= ' s ') Ax.scatter (Xcord2, Ycord2, s=30, c= ' green ') x = Arange ( -3.0, 3.0, 0.1) y = (-weights[0]-weights[1]* x)/weights[2] Ax.plot (x, y) plt.xlabel (' X1 '); Plt.ylabel (' X2 '); Plt.show () if __name__ = = "__main__": Dataarr,labelmat = Loaddataset () print (Dataarr) print (labelmat) weights = Gradascent (Dataarr, Labelmat) print (weights) Plotbestfit (Weights.geta ())
The most important step in the Gradascent function is the derivation of the theta iteration, the preceding formula. Plotbestfit draw a decision boundary.
Logistic regression and Python implementation