Practical notes for machine learning 5 (Logistic regression)

Source: Internet
Author: User

1: simple concept description

If there are some data points today, we use a straight line to fit these points (to change the line is called the best fit line), this fitting process is called regression. The training classifier is used to find the optimum number of fit metrics.

Sigmoid-based function classification:The expected logistic regression function can accept all input and pre-extract the class. This function is the sigmoid function, which is also like a step function. The formula is as follows:

Where: z = w0x0 + w1x1 + .... + Wnxn, W indicates the number of features, and X indicates the feature.

To implement logistic regression classifier, we can use a regression coefficient for each feature, then add all the result values, and substitute the sum result into the sigmoid function, then a value ranging from 0 ~ A value between 1. No matter what 0.5 of data is classified into 1 category, less than 0.5 of data is classified into 0 category. Therefore, logistic regression can also be seen as a probability prediction.

Gradient rise method:The idea is to find the maximum value of a function, and the best way is to search along the gradient of the function.

This formula will be iterated until a certain stop condition is reached. For example, the number of iterations reaches a specified value or the algorithm reaches an acceptable error range.

2: Python code implementation

(1) Use gradient rise to find the optimal number of workers

From numpy import * # load data def loaddataset (): datamat = []; labelmat = [] Fr = open('testset.txt ') for line in Fr. readlines (): linearr = line. strip (). split () datamat. append ([1.0, float (linearr [0]), float (linearr [1]) labelmat. append (INT (linearr [2]) return datamat, labelmat # Calculate the sigmoid function def sigmoid (inflow): Return 1.0/(1 + exp (-inflow )) # gradient rise algorithm-calculate regression coefficient def gradascent (datamatin, classlabels): datamatrix = MAT (datamatin) # convert to numpy data type labelmat = MAT (classlabels ). transpose () m, n = shape (datamatrix) alpha = 0.001 maxcycles = 500 weights = ones (n, 1) for K in range (maxcycles ): H = sigmoid (datamatrix * weights) error = (labelmat-h) weights = weights + Alpha * datamatrix. transpose () * error return Weights


(2) draw decision boundaries

# Draw the decision boundary def plotbestfit (WEI): Import matplotlib. pyplot as PLT weights = Wei. geta () datamat, labelmat = loaddataset () dataarr = array (datamat) n = shape (dataarr) [0] xcord1 = []; ycord1 = [] xcord2 = []; ycord2 = [] For I in range (n): If int (labelmat [I]) = 1: xcord1.append (dataarr [I, 1]); ycord1.append (dataarr [I, 2]) else: xcord2.append (dataarr [I, 1]); ycord2.append (dataarr [I, 2]) fig = PLT. figure () AX = fig. add_subplot (111) ax. scatter (xcord1, ycord1, S = 30, c = 'red', marker = 's') ax. scatter (xcord2, ycord2, S = 30, c = 'green') x = arange (-3.0, 3.0, 0.1) y = (-weights [0]-weights [1] * X)/weights [2] ax. plot (x, y) PLT. xlabel ('x1'); PLT. ylabel ('x2 '); PLT. show ()



(3) random gradient rise Algorithm

The gradient rise algorithm is well suited to processing about 100 datasets. However, assuming billions of samples and thousands of features, the computing complexity of this method is too high. The improved method is the random gradient rise algorithm. This method uses only one sample point at a time to update the regression coefficient. It consumes less computing resources. It is an online algorithm that completes the number of shards update when data arrives, without the need to read the entire dataset for batch processing. Processing all the data at a time is called batch processing.

# Random gradient rise algorithm def stocgradascent0 (datamatrix, classlabels): datamatrix = array (datamatrix) m, n = shape (datamatrix) alpha = 0.1 weights = ones (N) for I in range (m): H = sigmoid (sum (datamatrix [I] * weights )) error = classlabels [I]-H weights = weights + Alpha * Error * datamatrix [I] Return Weights



(4) Improved random gradient rise Algorithm

# Improved random gradient rise algorithm def stocgradascent1 (datamatrix, classlabels, numinter = 150): datamatrix = array (datamatrix) m, n = shape (datamatrix) weights = ones (N) for J in range (numinter): dataindex = range (m) For I in range (m): alpha = 4/(1.0 + J + I) + 0.01 # adjust randindex = int (random. uniform (0, Len (dataindex) # randomly select update H = sigmoid (sum (datamatrix [randindex] * weights )) error = classlabels [randindex]-H weights = weights + Alpha * Error * datamatrix [randindex] del [dataindex [randindex] Return Weights



Note: three major improvements have been made: <1> alpha is adjusted during each iteration, which can reduce data fluctuations or high-frequency fluctuations. <2> updated regression coefficient by randomly selecting samples, which can reduce periodic fluctuations <3> added an Iterative Regression number.

3: case study-mortality from hernia disease pre-infected horses

(1) method for processing missing values in data:


However, we can only discard the data that is lost by category tags.

(2) case code

# Case-mortality def classifyvector (loss, weights): prob = sigmoid (sum (loss * weights) If prob> 0.5: Return 1.0 else: return 0.0def colictest (): frtrain = open('horsecolictraining.txt ') frtest = open('horsecolictest.txt') trainingset = []; traininglabels = [] for line in frtrain. readlines (): currline = line. strip (). split ('\ t') linearr = [] For I in range (21): linearr. append (float (currline [I]) trainin Gset. append (linearr) traininglabels. append (float (currline [21]) trainweights = stocgradascent1 (trainingset, traininglabels, 500) errorcount = 0; numtestvec = 0.0 for line in frtest. readlines (): numtestvec ++ = 1.0 currline = line. strip (). split ('\ t') linearr = [] For I in range (21): linearr. append (float (currline [I]) If int (classifyvector (Array (linearr), trainweights ))! = Int (currline [21]): errorcount + = 1 errorrate = (float (errorcount)/numtestvec) print 'the error rate of this test is: % F' % errorrate return errorratedef multitest (): numtests = 10; errorsum = 0.0 for K in range (numtests): errorsum + = colictest () print 'after % d iterations the average error rate is: % F' % (numtests, errorsum/float (numtests ))


4: Conclusion

The objective of Logistic regression is to find the optimal number of fit records for a non-linear function sigmoid. The process of solving the problem can be completed by the optimization algorithm. In the optimization algorithm, the gradient rise algorithm is most often used, and the gradient rise algorithm can be simplified to the random gradient rise algorithm.

The random gradient rise algorithm and gradient rise algorithm have the same effect, but occupy less computing resources. In addition, the random gradient is an online algorithm that completes the update of the number of shards when data arrives, without the need to read the entire dataset again for batch processing.

Note: 1: This note comes from books <machine learning practices>

2: logregres. py file and note data downloaded here (http://download.csdn.net/detail/lu597203933/7735821 ).

Small village chief source: http://blog.csdn.net/lu597203933 welcome to reprint or share, but please be sure to declare the source of the article. (Sina Weibo: small village chief Zack. Thank you !)

Practical notes for machine learning 5 (Logistic regression)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.