Python Machine Learning Theory and Practice (4) Logistic regression and python Learning Theory

Source: Internet
Author: User

Python Machine Learning Theory and Practice (4) Logistic regression and python Learning Theory

From this section, I started to go to "regular" machine learning. The reason is "regular" because it starts to establish a value function (cost function) and then optimizes the value function to obtain the weight, then test and verify. This entire process is an essential part of machine learning. The topic to learn today is logical regression, which is also a supervised learning method (supervised machine learning ). Logistic regression is generally used for prediction or classification. prediction is a type ^. ^! Linear regression is better than everyone else. y = kx + B. Given a bunch of data points, just fit the values of k and B. The next time X is given, we can calculate y, Which is regression. Logistic regression is a little different from this. It is a non-linear function and has a powerful fitting function. Furthermore, it is a continuous function and can be used for derivation. This is very important, if a function cannot be exported, It is very troublesome to use in machine learning. The early Heaviside step function is replaced by the sigmoid function, this means that we can quickly find the extreme point, which is one of the important ideas of the Optimization Method: Using the derivation to obtain the gradient, and then using the Gradient Descent Method to update the parameter.

The following describes the sigmoid function of Logistic regression, as shown in figure 1:

(Figure 1)

(Figure 1) shows the shape of the sigmoid function in the [-5, 5] field, but in the [-60, 60] field, it is suitable for second-class regression because of severe polarization. Shows the Sigmoid function (Formula 1:

(Formula 1)

Now with the second-class regression function model, we can map features to this model, and the independent variable of the sigmoid function has only one Z. Assume that our features are X = [x0, x1, x2... Xn]. So that when a large number of training sample feature X is given, we only need to find the appropriate W = [w0, w1, w2... Wn] to correctly map each sample feature X to the sigmoid function. That is to say, class regression can be completed correctly. Then a test sample will be created later, after the value is multiplied by the weight, the value calculated by the sigmoid function is the predicted value. How can we calculate the weight of W?

To calculate W, it is necessary to enter the optimization solution stage. The gradient descent method or the random gradient descent method is used. When it comes to gradient descent, what is the gradient for gradient descent? Gradient is the fastest rising direction of a function. We can quickly find the extreme point along the gradient. What extreme values are we looking? Think about it, of course, to find the extreme error values of the Training Model. When the error between the model prediction value and the correct value given by the training sample is the least hour, the model parameters are what we need. Of course, the minimum error may lead to overfitting. We first establish a cost function for model training, as shown in formula 2:


(Formula 2)

In formula 2, Y indicates the actual value of the training sample. When J (theta) is obtained at the hour, theta is the model weight we require. We can see that J (theta) is a convex function, the minimum value is also the global minimum. The gradient is obtained after the derivation, as shown in (Formula 3:


(Formula 3)

Because we are looking for a minimum value and the gradient direction is the maximum value direction, we take the negative number and update the parameters along the negative gradient direction, as shown in formula 4:


(Formula 4)

According to the parameter update method (formula 4), when the weight does not change, we claim to have found the extreme point. At this time, the weight is also required. The whole parameter is updated, as shown in figure 2) as shown in:


(Figure 2)

So far, the logic regression is basically finished, and the actual code stage is as follows:

from numpy import *  def loadDataSet():   dataMat = []; labelMat = []   fr = open('testSet.txt')   for line in fr.readlines():     lineArr = line.strip().split()     dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])])     labelMat.append(int(lineArr[2]))   return dataMat,labelMat  def sigmoid(inX):   return 1.0/(1+exp(-inX)) 

The above two functions load the training set and define the sigmoid function, both of which are relatively simple. The following code generates a gradient descent:

def gradAscent(dataMatIn, classLabels):   dataMatrix = mat(dataMatIn)       #convert to NumPy matrix   labelMat = mat(classLabels).transpose() #convert to NumPy matrix   m,n = shape(dataMatrix)   alpha = 0.001   maxCycles = 500   weights = ones((n,1))   for k in range(maxCycles):       #heavy on matrix operations     h = sigmoid(dataMatrix*weights)   #matrix mult     error = (labelMat - h)       #vector subtraction     weights = weights + alpha * dataMatrix.transpose()* error #matrix mult   return weights 

Gradient Descent input training set and corresponding label, followed by iteration and new parameters, calculate the gradient, and then update the parameters. Note that the last sentence is according to (Formula 3) and (formula 4) to update parameters.

To intuitively see whether the weights we get are correct, we print out the weights and samples. below is the relevant printing code:

def plotBestFit(weights):   import matplotlib.pyplot as plt   dataMat,labelMat=loadDataSet()   dataArr = array(dataMat)   n = shape(dataArr)[0]    xcord1 = []; ycord1 = []   xcord2 = []; ycord2 = []   for i in range(n):     if int(labelMat[i])== 1:       xcord1.append(dataArr[i,1]); ycord1.append(dataArr[i,2])     else:       xcord2.append(dataArr[i,1]); ycord2.append(dataArr[i,2])   fig = plt.figure()   ax = fig.add_subplot(111)   ax.scatter(xcord1, ycord1, s=30, c='red', marker='s')   ax.scatter(xcord2, ycord2, s=30, c='green')   x = arange(-3.0, 3.0, 0.1)   y = (-weights[0]-weights[1]*x)/weights[2]   ax.plot(x, y)   plt.xlabel('X1'); plt.ylabel('X2');   plt.show() 

As shown in figure 3:


(Figure 3)

It can be seen that the effect is quite good, and small errors are inevitable. If there is no error in the training set, it would be dangerous to say that this is basically done, but considering this method for a small number of samples (several hundred) it's okay. In reality, when we encounter a magnitude of 1 billion and the feature dimension is thousands, this method is terrible, and it takes a lot of time to calculate the gradient. Therefore, we need to use the random gradient descent method. The theory of the random gradient descent algorithm is the same as that of the gradient descent algorithm. The gradient calculation does not use all samples, but uses one or a small batch to calculate the gradient. This reduces the computing cost, although the weight Update PATH is tortuous, it will eventually converge, as shown in figure 4.


(Figure 4)

The following code also generates a random gradient descent:

def stocGradAscent1(dataMatrix, classLabels, numIter=150):   m,n = shape(dataMatrix)   weights = ones(n)  #initialize to all ones   for j in range(numIter):     dataIndex = range(m)     for i in range(m):       alpha = 4/(1.0+j+i)+0.0001  #apha decreases with iteration, does not        randIndex = int(random.uniform(0,len(dataIndex)))#go to 0 because of the constant       h = sigmoid(sum(dataMatrix[randIndex]*weights))       error = classLabels[randIndex] - h       weights = weights + alpha * error * dataMatrix[randIndex]       del(dataIndex[randIndex])   return weights 

Finally, a classification code is provided. You only need to set the threshold value to 0.5, classify the threshold value greater than 0.5 as one type, and classify the threshold value less than 0.5 as another type. The Code is as follows:

def classifyVector(inX, weights):   prob = sigmoid(sum(inX*weights))   if prob > 0.5: return 1.0   else: return 0.0 

Summary:

Advantage: low computing workload, easy to implement, and easy to describe for real data

Disadvantage: it is easy to perform underfitting, and the accuracy may not be high.

References:

[1] machine learning in action. Peter Harrington

[2] machine learning. Andrew Ng

The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.