Machine learning (vi)-logistic regression

Source: Internet
Author: User

Recently have been looking at machine learning related algorithms, today learning logistic regression, after the simple analysis of the algorithm implementation of programming, through the example of validation.

A logistic overview

The regression of personal understanding is to find the relationship between variables, that is, to seek regression coefficients, often using regression to predict the target value. Regression and classification belong to supervised learning, the difference is that the goal variable of regression must be a continuous numerical type.

The main idea of logistic regression to be studied today is to establish a regression formula based on the existing data to classify the classification boundary lines. It is mainly used in epidemiology, and the common situation is to explore the risk factors of a disease, predict the probability of occurrence of a disease according to the risk factors, etc. The dependent variables of logistic regression can be classified as two or multi-classified, but the two classification is more commonly used and easier to explain, so the most commonly used is the logistic regression of two classification.

Today we analyze the two classification, we need a function in the regression analysis to accept all the input and then predict the category, assuming that 0 and 1 represent two categories, the logistic function curve is very similar to the S-type, so we can contact the sigmoid function: σ= 1/(1/(1+e-z)). In order to implement the logistic regression classifier, we can add all the products with a regression coefficient at the top of each feature, substituting the sum of the values into the sigmoid function to get a number between 0-1 and if the value is greater than 0.5 is classified into 1 classes, otherwise it is classified as 0 class.

Based on the previous analysis, we need to find the regression coefficients, first we can record the input form of the sigmoid function as: z = w0x0 + w1x1 +...+wnxn, where x is the input data, the corresponding w is the coefficient we require, in order to obtain the best coefficient, combined with the optimization theory, We can select the gradient ascending method to optimize the algorithm. The basic idea of the gradient rise method is that to find the maximum value of a function, the best way is to look for it along the gradient direction of the function. To learn more about this approach, it is recommended to look at Andrew Ng's machine learning course, and remember that in the second section the gradient descent method is the main one, and the gradient rise differs from the minimum value of the function, but the thought is consistent.

Two Python implementations

Based on the previous analysis, in this section we implement the logistic regression in Python programming, and today I use the 2.7 version, the code is as follows:

#Coding:utf-8 fromNumPyImport*ImportMathImportMatplotlib.pyplot as Plt#Import DatadefLoaddataset (): Datamat=[] Labelmat=[] FR= Open ('TestSet.txt')   forLineinchfr.readlines (): Linearr= Line.strip (). Split ()#separate the characters in each line of the text into a listDatamat.append ([1.0,float (linearr[0]), float (linearr[1])]) labelmat.append (int (linearr[2]))  returnDatamat,labelmat#defining the Sigmoid functiondefsigmoid (InX):return1.0/(1+exp (-InX))#finding the regression coefficients by the gradient ascending methoddefgradascent (Data,label): Datamat=Mat (data) Labelmat=Mat (label). Transpose () M,n=shape (datamat) Alpha= 0.001Maxcycles= 500Weights= Ones ((n,1))   forIteminchRange (maxcycles): H= Sigmoid (Datamat *weights) Error= (labelmat-h)#Note that the data type of an element in Labelmat should be intweights = weights + Alpha * datamat.transpose () *ErrorreturnWeights" "#测试data, label = Loaddataset () print gradascent (Data,label)" "##求出回归系数之后, the dividing line between different data categories is determined, so that the line can be drawn for ease of understanding .defPlotbestfit (weights): Datamat,labelmat=loaddataset () Dataarr=Array (datamat) n=shape (Dataarr) [0] Xcode1=[] Ycode1=[] Xcode2=[] Ycode2= []   forIinchrange (n):ifInt (Labelmat[i]) = = 1: Xcode1.append (Dataarr[i,1]) ycode1.append (Dataarr[i,2])    Else: Xcode2.append (Dataarr[i,1]) ycode2.append (Dataarr[i,2]) FIG=plt.figure () Ax= Fig.add_subplot (111) Ax.scatter (xcode1,ycode1,s= 30,c ='Red', marker ='s') Ax.scatter (xcode2,ycode2,s= 30,c ='Green') x= Arange ( -3.0,3.0,0.1) y= (-weights[0]-weights[1] * x)/weights[2] Ax.plot (x, y) Plt.xlabel ('X1') Plt.ylabel ('Y1') plt.show ()" "#测试data, label = Loaddataset () weights = gradascent (Data,label) Plotbestfit (Weights.geta ())" "# #改进的梯度上升法defStocGradAscent1 (Datamatrix, Classlabels, numiter=150): M,n=shape (datamatrix) Weights= Ones (n)#initialize to all ones     forJinchRange (numiter): Dataindex=Range (m) forIinchRange (m): Alpha= 4/(1.0+j+i) +0.0001Randindex=Int (random.uniform (0,len (Dataindex))) H= sigmoid (SUM (datamatrix[randindex]*weights)) Error= Classlabels[randindex]-h Weights= weights + Alpha * ERROR *Datamatrix[randindex]del(Dataindex[randindex])returnWeights" "#测试data, label = Loaddataset () weights = stocGradAscent1 (Array (data), label) Plotbestfit (weights)" "

Three example analysis

Based on previous analysis, this section uses logistic regression to predict the survival of a horse with hernia, the code is as follows:

defClassifyvector (inx,weights): Prob= sigmoid (SUM (InX *weights)) ifProb > 0.5:    return1.0Else:    return0.0defcolictest (): Frtrain= Open ('HorseColicTraining.txt'); Frtest = open ('HorseColicTest.txt') Trainingset= []; Traininglabels = []     forLineinchfrtrain.readlines (): Currline= Line.strip (). Split ('\ t') Linearr=[]         forIinchRange (21): Linearr.append (float (currline[i)) trainingset.append (Linearr) traininglabels.append (float (currline[21st])) trainweights= StocGradAscent1 (Array (trainingset), Traininglabels, 1000) Errorcount= 0; Numtestvec = 0.0 forLineinchfrtest.readlines (): Numtestvec+ = 1.0Currline= Line.strip (). Split ('\ t') Linearr=[]         forIinchRange (21): Linearr.append (float (currline[i)))ifInt (classifyvector (Array (Linearr), trainweights))! = Int (currline[21]): Errorcount+ = 1errorrate= (float (errorcount)/Numtestvec)Print "the error rate is:", Errorratereturnerrorratedefmultitest (): Numtests= 10; errorsum=0.0 forKinchRange (numtests): Errorsum+=colictest ()Print "The average error rate is", (Numtests, errorsum/float (numtests)) multitest ()

Finally, we can see that the error rate is around 35%, and the error rate can be further reduced by adjusting the step size.

The purpose of logistic regression is to find the best fitting parameters of a nonlinear sigmoid function, which can be optimized by gradient ascending method, in order to reduce the complexity of time, the gradient rise method can be simplified by using the stochastic gradient rise method.

Machine learning (vi)-logistic regression

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.