Practical notes for machine learning 5 (Logistic regression)

Source: Internet
Author: User

1: simple concept description

Assuming that there are some data points, we use a straight line to fit these points (to change the line is called the best fit line), this fitting process is called regression. The training classifier is used to find the optimal fitting parameters.

Sigmoid-based function classification:Logistic regression allows the function to accept all input and then predict the category. This function is the sigmoid function, which is also like a step function. The formula is as follows:

Where: z = w0x0 + w1x1 + .... + Wnxn, W is the parameter, and X is the feature

To implement logistic regression classifier, we can use a regression coefficient for each feature, add all the result values, and substitute the sum result into the sigmoid function, then a value ranging from 0 ~ A value between 1. Any data greater than 0.5 is classified into 1 category, and data smaller than 0.5 is classified into 0 categories. Therefore, logistic regression can also be considered as a probability estimation.

Gradient rise method:The idea is to find the maximum value of a function, and the best way is to search along the gradient of the function.

This formula will be iterated until it reaches a certain stop condition. For example, the number of iterations reaches a specified value or the algorithm reaches a certain allowable error range.

2: Python code implementation

(1) Use gradient rise to find the optimal parameter

From numpy import * # load data def loaddataset (): datamat = []; labelmat = [] Fr = open('testset.txt ') for line in Fr. readlines (): linearr = line. strip (). split () datamat. append ([1.0, float (linearr [0]), float (linearr [1]) labelmat. append (INT (linearr [2]) return datamat, labelmat # Calculate the sigmoid function def sigmoid (inflow): Return 1.0/(1 + exp (-inflow )) # gradient rise algorithm-calculate regression coefficient def gradascent (datamatin, classlabels): datamatrix = MAT (datamatin) # convert to numpy data type labelmat = MAT (classlabels ). transpose () m, n = shape (datamatrix) alpha = 0.001 maxcycles = 500 weights = ones (n, 1) for K in range (maxcycles ): H = sigmoid (datamatrix * weights) error = (labelmat-h) weights = weights + Alpha * datamatrix. transpose () * error return Weights


(2) draw decision boundaries

# Draw the decision boundary def plotbestfit (WEI): Import matplotlib. pyplot as PLT weights = Wei. geta () datamat, labelmat = loaddataset () dataarr = array (datamat) n = shape (dataarr) [0] xcord1 = []; ycord1 = [] xcord2 = []; ycord2 = [] For I in range (n): If int (labelmat [I]) = 1: xcord1.append (dataarr [I, 1]); ycord1.append (dataarr [I, 2]) else: xcord2.append (dataarr [I, 1]); ycord2.append (dataarr [I, 2]) fig = PLT. figure () AX = fig. add_subplot (111) ax. scatter (xcord1, ycord1, S = 30, c = 'red', marker = 's') ax. scatter (xcord2, ycord2, S = 30, c = 'green') x = arange (-3.0, 3.0, 0.1) y = (-weights [0]-weights [1] * X)/weights [2] ax. plot (x, y) PLT. xlabel ('x1'); PLT. ylabel ('x2 '); PLT. show ()



(3) random gradient rise Algorithm

The gradient rise algorithm is well suited to processing about 100 datasets. However, if there are billions of samples and thousands of features, the computing complexity of this method is too high. The improved method is the random gradient rise algorithm. This method uses only one sample point at a time to update the regression coefficient. It consumes less computing resources. It is an online algorithm that can update parameters when data arrives without re-reading the entire dataset for batch processing. Processing all data at a time is called batch processing.

# Random gradient rise algorithm def stocgradascent0 (datamatrix, classlabels): datamatrix = array (datamatrix) m, n = shape (datamatrix) alpha = 0.1 weights = ones (N) for I in range (m): H = sigmoid (sum (datamatrix [I] * weights )) error = classlabels [I]-H weights = weights + Alpha * Error * datamatrix [I] Return Weights



(4) Improved random gradient rise Algorithm

# Improved random gradient rise algorithm def stocgradascent1 (datamatrix, classlabels, numinter = 150): datamatrix = array (datamatrix) m, n = shape (datamatrix) weights = ones (N) for J in range (numinter): dataindex = range (m) For I in range (m): alpha = 4/(1.0 + J + I) + 0.01 # adjust randindex = int (random. uniform (0, Len (dataindex) # randomly select update H = sigmoid (sum (datamatrix [randindex] * weights )) error = classlabels [randindex]-H weights = weights + Alpha * Error * datamatrix [randindex] del [dataindex [randindex] Return Weights



Note: three major improvements have been made: <1> alpha is adjusted during each iteration, which can reduce data fluctuations or high-frequency fluctuations. <2> the regression coefficient is updated by randomly selecting samples to reduce periodic fluctuations <3> An Iteration parameter is added.

3: case study-prediction of mortality from hernia

(1) method for processing missing values in data:


However, we can only discard the data that is lost by category tags.

(2) case code

# Case-prediction of mortality def classifyvector (death, weights): prob = sigmoid (sum (distinct * weights) If prob> 0.5: Return 1.0 else: return 0.0def colictest (): frtrain = open('horsecolictraining.txt ') frtest = open('horsecolictest.txt') trainingset = []; traininglabels = [] for line in frtrain. readlines (): currline = line. strip (). split ('\ t') linearr = [] For I in range (21): linearr. append (float (currline [I]) trainin Gset. append (linearr) traininglabels. append (float (currline [21]) trainweights = stocgradascent1 (trainingset, traininglabels, 500) errorcount = 0; numtestvec = 0.0 for line in frtest. readlines (): numtestvec ++ = 1.0 currline = line. strip (). split ('\ t') linearr = [] For I in range (21): linearr. append (float (currline [I]) If int (classifyvector (Array (linearr), trainweights ))! = Int (currline [21]): errorcount + = 1 errorrate = (float (errorcount)/numtestvec) print 'the error rate of this test is: % F' % errorrate return errorratedef multitest (): numtests = 10; errorsum = 0.0 for K in range (numtests): errorsum + = colictest () print 'after % d iterations the average error rate is: % F' % (numtests, errorsum/float (numtests ))


4: Conclusion

The objective of Logistic regression is to find the optimal fitting parameter of a non-linear function sigmoid. The process of solving the problem can be completed by the optimization algorithm. In the optimization algorithm, the most common is the gradient rise algorithm, and the gradient rise algorithm can be simplified to a random gradient rise algorithm.

The random gradient rise algorithm and gradient rise algorithm have the same effect, but occupy less computing resources. In addition, the random gradient is an online algorithm that can update parameters when data arrives without re-reading the entire dataset for batch processing.

Note: 1: This note comes from books <machine learning practices>

2: logregres. py file and note data downloaded here (http://download.csdn.net/detail/lu597203933/7735821 ).

Small village chief source: http://blog.csdn.net/lu597203933 welcome to reprint or share, but please be sure to declare the source of the article. (Sina Weibo: small village chief Zack. Thank you !)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.