Machine Learning Algorithms and Python practices (7) Logistic Regression)
Zouxy09@qq.com
Http://blog.csdn.net/zouxy09
This series of machine learning algorithms and Python practices mainly refer to "machine learning practices. Because I want to learn Python and learn more about some machine learning algorithms, I want to use Python to implement several commonly used machine learning algorithms. I just met this book with the same positioning, so I learned it by referring to the course of this book.
This section describes Logistic Regression, which is also an orthodox machine learning algorithm. What is Orthodox? In my concept, the machine learning algorithm is generally like this:
1) for a problem, we use a mathematical language to describe it, and then create a model, such as a regression model or a classification model, to describe the problem;
2) using the maximum likelihood, the maximum posterior probability, or the minimum classification error to create a model cost function is an optimization problem. Find the solution to the optimization problem, that is, the best model parameter that fits our data;
3) Then we need to solve this cost function and find the optimal solution. This solution is divided into many situations:
A) if this optimization function is parsed. For example, the greatest value is the derivative of the cost function. Find the point with the derivative of 0, that is, the maximum or minimum value. If the cost function can be simply used for derivation and there is a parsing solution for the formula 0 after the derivation, we can directly obtain the optimal parameter.
B) if the formula is difficult to evaluate, for example, the implicit variables or variables in the function are coupled with each other, then they are mutually dependent. Or the derivation formula cannot be interpreted. For example, the number of unknown parameters is greater than the number of known equations. At this time, we need to use iterative algorithms to find the most specific solution step by step. Iteration is a magical thing. It remembers the ambitious goal (that is, finding the optimal solution, for example, climbing to the top of the hill, then set a short-term goal for yourself (that is, every step is closer to the ambitious goal), down-to-earth, with no loans, like a snail bait, the only belief behind one step-by-step crawling is that as long as I climb a little higher in each step, the accumulation of one step will surely reach the peak of my life, enjoy the heroic and self-defeating mountain peaks.
In addition, if the cost function is a convex function, there will be a global optimal solution. There will be only one mountain in five hundred square meters. It is doomed, and it is the only one you want to find. But if it is not convex, there will be a lot of local optimal solutions, there is an endless mountain, people's vision is great and small, you do not know which mountain is the highest, maybe you will be confused by fate, so you may fall into a local optimum and sit down and think that what you find is the best. I did not expect that there are mountains outside of the mountains, and there are people outside of the people, the light is always quietly blooming in the distance of the unknown. However, fate may fall in love with you and bring you the best destination. There are also a lot of unbelieve people who feel that they are willing to win the sky, swear to find the best, otherwise they will not stop, never compromise on fate, unless they are tired one day, fell down, we also need to take the rest breath to support the journey. So miserable ...... Haha.
Er, I don't know if I have made any mistakes. If there are any mistakes, please let us know. Now let's go to the topic. As mentioned above, logistic regression is a process like this: in the face of a regression or classification problem, a cost function is established, and then the optimal model parameters are obtained through iteration through the optimization method, then we tested and verified whether the model we used was the most suitable one.
1. LogisticRegression)
Logistic regression (Logistic regression) is a common machine learning method used in the industry to estimate the possibility of something. Previously, I saw it in the classic "Beauty of mathematics" for advertising prediction, that is, based on the possibility that an advertisement is clicked by users, place the advertisement most likely to be clicked by the user where the user can see it, and then ask him to "Click me!" Once the user clicks, you will have money to accept it. This is why our computers are spreading advertisements.
There are also similar possibilities for a user to buy a certain item, and the possibility of a patient suffering from a certain disease. This world is random (except for human deterministic systems, of course, there may also be noise or incorrect results, but the possibility of this error is too small, as small as a thousand years, as small as ignore), the occurrence of all things can be expressed by possibility or probability (Odds. "Probability" refers to the ratio of the possibility of a thing to the possibility of not happening.
Logistic regression can be used for regression or classification, mainly binary classification. Do you still remember the SVM mentioned in the previous sections? It is a binary classification. For example, it can separate two samples of different classes. The idea is to find the hyper-plane of the classification that best distinguishes them. But when you give it a new sample, it can give you only one answer, whether your sample is positive or negative. For example, if you ask SVM whether a girl likes you, it will only answer you or not. This is too rude for us. It is not conducive to physical and mental health if we have no Hope or Despair. If it can tell me that she likes it very much, does not like it very much, or does not like it at all, you don't need to think about it, and so on, tell you that she has a 49% chance of liking you, she is always more gentle than saying she doesn't like you. It also provides additional information about how much hope she has come to your side, and how many times you have to work harder. Logistic regression is so gentle that it provides us with the possibility that your sample belongs to a positive class.
You have to have a math. (For more understanding, see references.) Suppose our sample is {X, Y}, y is 0 or 1, indicating positive or negative class,XIs our m-dimensional sample feature vector. So this sampleXIt is a positive class, that is, the "probability" of y = 1 can be expressed by the following logical functions:
HereθIs the model parameter, that is, the regression coefficient. σ is the sigmoid function. In fact, this function is based on the following logarithm probability (that isX <喎?http: www.bkjia.com kf ware vc " target="_blank" class="keylink"> Vc3Ryb25nPsr009rV/cDgtcS/ycTc0NS6zbi6wOC1xL/jxnzq1lxesc1_tbxetttk/centers = "center">
In other words, y is the variable of our relationship. For example, if she likes you or not, it is related to multiple independent variables (factors, for example, what is your character, whether the car is in two or four rounds, whether it is better than PAN an, whether it is a fight with Xi Lige, whether there is a thousand-foot mansion or a three-inch cottage, etc, we represent these factors as x1, x2 ,..., Xm. How can this girl consider these factors? The fastest way is to add up the scores of these factors. The greater the sum, the more you like. However, everyone has a saying in their hearts that the reasons for each person are different. What do they love. For example, this girl is more interested in your character. The weight of the character is 0.6. If you do not pay attention to whether you have money or have no money to work together, then the weight of the money is 0.001. We will map these to x1, x2 ,..., The weight of xm is called regression coefficient, expressed as θ 1, θ 2 ,..., θ m. Their weighted sum is your total score. Select your favorite boy, If You Are the One! Haha.
Therefore, the above logistic regression is a linear classification model, which differs from linear regression in that, in order to output a large number of linear regression values, such as from negative infinity to positive infinity, compress to 0 and 1, so that the output value is expressed as "possibility" to persuade the masses. Of course, it is also good to compress the handler to this range, that is, it can eliminate the influence of very sharp variables (I do not know whether it is understood correctly ). To achieve this great function, you only need to perform the ordinary operation, that is, add a logistic function to the output. In addition, for binary classification, we can simply consider that if the sampleXIf the probability of a positive class is greater than 0.5, it is determined that it is a positive class, otherwise it is a negative class. In fact, the class probability of SVM is the distance from the sample to the boundary. This activity actually makes logistic regression dry.
Therefore, LogisticRegression is a linear regression after the logistic equation is normalized.
Okay. Let's talk about the gossip about LR. Under the Orthodox machine learning framework, the model has been selected, but the model parameter θ is still unknown. We need to use the data we have collected to train and solve it. The next step is to create a cost function.
The most basic learning algorithm for LogisticRegression is the maximum likelihood. What is the maximum likelihood? Let's look at my other blog post, "from the maximum likelihood to the EM algorithm ".
Suppose we have n independent training samples {(X1, Y1 ),(X2, Y2 ),..., (Xn, Yn)}, y = {0, 1 }. Then every observed sample (Xi, Yi) the probability of occurrence is:
Why is the above? When y = 1, if the next item does not exist, then only the probability that x belongs to Class 1 is left. When y = 0, the first item does not exist, then there is only the probability that the next x belongs to 0 (1 minus the probability that x belongs to 1 ). Therefore, no matter whether y is 0 or 1, The number obtained above is the probability that (x, y) appears. Then our entire sample set, that is, the likelihood function of n independent samples, is (because each sample is independent, so the probability that n samples appear is the probability of their respective appearances multiplied ):
The maximum likelihood method is to find the maximum coefficient of the likelihood function in the model. The value is θ *. The maximum likelihood is our cost function.
OK, the cost function is ready. The next step is to optimize the solution. Let's first try to evaluate the derivative of the above price function and see if it can be solved when the derivative is 0, that is, whether there is a resolution or not. When there is such a solution, we will all be happy, with one step in place. If not, iteration is required, which consumes time and effort.
Let's first change L (θ): Take the natural logarithm, and then simplify it (don't be afraid to see a bunch of formulas. It's very simple. Just be patient, start pushing and you will know. Note: YesXWhen I is used, it indicates that it is the I-th sample. I believe your eyes are bright:
At this time, use L (θ) to evaluate θ and obtain:
Then we make the derivative 0, and you will be disappointed to find that it cannot be resolved. Try it if you don't believe it. So there is no way to do it. It can only be done through iteration on the tall. Here we use the classic gradient descent algorithm.
Ii. Optimization
2.1 gradient descent (gradient descent)
Gradient descent, also known as steepest descent, is a method for finding the local optimal solution of a function using the first-order Gradient information. It is also the simplest and most commonly used Optimization Method in machine learning. Its idea is very simple. As I mentioned in the opening chapter, to find the minimum value, I only need to go down every step (that is, every step can reduce the cost function ), then, you can go to the minimum value, as shown in the following figure:
However, I also need to reach the minimum value faster. What should I do? We need to find the fastest way to go downhill in every step, that is, every step is closer to the minimum value than other methods. The fastest direction of this downhill is the negative direction of the gradient.
For logistic Regression, the gradient descent algorithm is released as follows:
The parameter α is called the learning rate, that is, how far each step goes. this parameter is critical. If you set too many parameters, you can easily append them to the optimal value because you are too busy. For example, if you want to go from Guangzhou to Shanghai, but the distance between you is that between Guangzhou and Beijing. If you don't know how far it is, How lucky can you make such a huge step? Or unfortunately? Things always have two sides. The benefit is that they can quickly return to the optimal value from a place far away from the optimal value, but they are powerless when the optimal value is near. However, if the setting is too small, the convergence speed will be too slow. Like a snail bait, although it will fall into the optimal point, but if it is a monkey year and a month, we have no patience. Therefore, some improvements are made in this learning rate. At the beginning of the iteration, when the learning rate is high and gradually approaching the optimal value, my learning rate will become smaller. The essence of both! For details about this optimization, see 2.3.
The pseudo code of the gradient descent algorithm is as follows:
######################################## ########
The initial regression coefficient is 1.
Repeat the following steps until convergence {
Calculate the gradient of the entire dataset
Use alpha x gradient to update the regression coefficient
}
Returns a regression numeric value.
######################################## ########
2.2. stochastic gradient descent)
Each time the gradient descent algorithm updates the regression coefficient, it needs to traverse the entire dataset (calculate the regression error of the entire dataset). This method is acceptable for small datasets. However, when there are billions of samples and thousands of features, the computing complexity is too high. The improved method is to use only one sample point (regression error) at a time to update the regression coefficient. This method is called the random gradient descent algorithm. Because we can incrementally update the classifier when A new sample arrives (assuming that we have trained A classifier h on database A, A new sample x is generated. For the non-incremental learning algorithm, we need to combine x and database A to form A new database B, and then re-train the new classifier. But for the incremental learning algorithm, we only need to use the new sample X to update the parameter of the existing classifier h), so it is an online learning algorithm. In contrast to online learning, a single processing of the entire dataset is called "batch processing ".
The pseudo code of the random gradient descent algorithm is as follows:
######################################## ########
The initial regression coefficient is 1.
Repeat the following steps until convergence {
For each sample in the dataset
Calculate the gradient of the sample
Use alpha xgradient to update the regression coefficient
}
Returns a regression numeric value.
######################################## ########
2.3 improved Random Gradient Descent
The main advantage of evaluating an optimization algorithm is to see whether it converges. That is to say, whether the parameter has reached a stable value will continue to change? Is the convergence speed fast?
It shows that the random gradient descent algorithm is in the first iteration (see sections 3 and 4 before coming back here. Our database has 100 two-dimensional samples, each of which is adjusted once for the coefficient, so a total of 200*100 = 20000 adjustments) Three regression coefficient change processes. The coefficient X2 reaches the stable value after 50 iterations. However, the coefficients X1 and X0 are stable after 100 iterations. What's hateful is that the coefficients X1 and X2 are still very naughty periodic fluctuations, and the number of iterations is so great that the heart cannot stop. The reason for this phenomenon is that there are some sample points that cannot be correctly classified, that is, our dataset is not linearly divided, but our logistic regression is a linear classification model, which cannot be divided into non-linear conditions. However, our optimization program is not aware of these abnormal sample points and treats them equally. The adjustment coefficient reduces the classification error of these samples, this leads to a drastic change in the coefficient during each iteration. For us, we expect the algorithm to avoid round-trip fluctuations, so as to quickly stabilize and converge to a certain value.
We have made two improvements to the random gradient descent algorithm to avoid the above fluctuations:
1) Adjust the alpha value of the update step during each iteration. As the iteration progresses, alpha becomes smaller and smaller, which relieves the high-frequency fluctuation of the coefficient (that is, the iteration coefficient changes too much and the hop span is too large ). Of course, in order to avoid alpha decreasing to close to 0 with iteration (at this time, the coefficient is almost not adjusted, so iteration is meaningless ), we constrain that alpha must be greater than a constant term of a slightly larger point. For details, see the code.
2) each iteration changes the optimization sequence of samples. That is, randomly select a sample to update the regression coefficient. This can reduce periodic fluctuations, because the change in the sample sequence makes each iteration no longer cyclical.
The pseudo code of the improved random gradient descent algorithm is as follows:
######################################## ########
The initial regression coefficient is 1.
Repeat the following steps until convergence {
For each sample in a randomly traversed Dataset
As iteration continues, the alpha value is reduced.
Calculate the gradient of the sample
Use alpha x gradient to update the regression coefficient
}
Returns a regression numeric value.
######################################## ########
Compared with the original random gradient descent and the improved gradient descent, we can see two differences:
1) The coefficient does not fluctuate periodically. 2) The coefficient can be quickly stabilized, that is, rapid convergence. It only converges after 20 iterations. The above random gradient descent requires 200 iterations to be stable.
Iii. Python implementation
I use Python 2.7.5. The additional libraries include Numpy and Matplotlib. For detailed installation and configuration, see the previous blog. You have provided detailed comments in the code. I don't know if there are any errors. If yes, I hope you can correct them (the results may be different each time ). I wrote a function for visual results, but it can only be used on two-dimensional data. Directly paste the Code:
LogRegression. py
################################################## logRegression: Logistic Regression# Author : zouxy# Date : 2014-03-02# HomePage : http://blog.csdn.net/zouxy09# Email : zouxy09@qq.com#################################################from numpy import *import matplotlib.pyplot as pltimport time# calculate the sigmoid functiondef sigmoid(inX):return 1.0 / (1 + exp(-inX))# train a logistic regression model using some optional optimize algorithm# input: train_x is a mat datatype, each row stands for one sample# train_y is mat datatype too, each row is the corresponding label# opts is optimize option include step and maximum number of iterationsdef trainLogRegres(train_x, train_y, opts):# calculate training timestartTime = time.time()numSamples, numFeatures = shape(train_x)alpha = opts['alpha']; maxIter = opts['maxIter']weights = ones((numFeatures, 1))# optimize through gradient descent algorilthmfor k in range(maxIter):if opts['optimizeType'] == 'gradDescent': # gradient descent algorilthmoutput = sigmoid(train_x * weights)error = train_y - outputweights = weights + alpha * train_x.transpose() * errorelif opts['optimizeType'] == 'stocGradDescent': # stochastic gradient descentfor i in range(numSamples):output = sigmoid(train_x[i, :] * weights)error = train_y[i, 0] - outputweights = weights + alpha * train_x[i, :].transpose() * errorelif opts['optimizeType'] == 'smoothStocGradDescent': # smooth stochastic gradient descent# randomly select samples to optimize for reducing cycle fluctuations dataIndex = range(numSamples)for i in range(numSamples):alpha = 4.0 / (1.0 + k + i) + 0.01randIndex = int(random.uniform(0, len(dataIndex)))output = sigmoid(train_x[randIndex, :] * weights)error = train_y[randIndex, 0] - outputweights = weights + alpha * train_x[randIndex, :].transpose() * errordel(dataIndex[randIndex]) # during one interation, delete the optimized sampleelse:raise NameError('Not support optimize method type!')print 'Congratulations, training complete! Took %fs!' % (time.time() - startTime)return weights# test your trained Logistic Regression model given test setdef testLogRegres(weights, test_x, test_y):numSamples, numFeatures = shape(test_x)matchCount = 0for i in xrange(numSamples):predict = sigmoid(test_x[i, :] * weights)[0, 0] > 0.5if predict == bool(test_y[i, 0]):matchCount += 1accuracy = float(matchCount) / numSamplesreturn accuracy# show your trained logistic regression model only available with 2-D datadef showLogRegres(weights, train_x, train_y):# notice: train_x and train_y is mat datatypenumSamples, numFeatures = shape(train_x)if numFeatures != 3:print "Sorry! I can not draw because the dimension of your data is not 2!"return 1# draw all samplesfor i in xrange(numSamples):if int(train_y[i, 0]) == 0:plt.plot(train_x[i, 1], train_x[i, 2], 'or')elif int(train_y[i, 0]) == 1:plt.plot(train_x[i, 1], train_x[i, 2], 'ob')# draw the classify linemin_x = min(train_x[:, 1])[0, 0]max_x = max(train_x[:, 1])[0, 0]weights = weights.getA() # convert mat to arrayy_min_x = float(-weights[0] - weights[1] * min_x) / weights[2]y_max_x = float(-weights[0] - weights[1] * max_x) / weights[2]plt.plot([min_x, max_x], [y_min_x, y_max_x], '-g')plt.xlabel('X1'); plt.ylabel('X2')plt.show()
Iv. Test Results
Test code:
Test_logRegression.py
################################################## logRegression: Logistic Regression# Author : zouxy# Date : 2014-03-02# HomePage : http://blog.csdn.net/zouxy09# Email : zouxy09@qq.com#################################################from numpy import *import matplotlib.pyplot as pltimport timedef loadData():train_x = []train_y = []fileIn = open('E:/Python/Machine Learning in Action/testSet.txt')for line in fileIn.readlines():lineArr = line.strip().split()train_x.append([1.0, float(lineArr[0]), float(lineArr[1])])train_y.append(float(lineArr[2]))return mat(train_x), mat(train_y).transpose()## step 1: load dataprint "step 1: load data..."train_x, train_y = loadData()test_x = train_x; test_y = train_y## step 2: training...print "step 2: training..."opts = {'alpha': 0.01, 'maxIter': 20, 'optimizeType': 'smoothStocGradDescent'}optimalWeights = trainLogRegres(train_x, train_y, opts)## step 3: testingprint "step 3: testing..."accuracy = testLogRegres(optimalWeights, test_x, test_y)## step 4: show the resultprint "step 4: show the result..."print 'The classify accuracy is: %.3f%%' % (accuracy * 100)showLogRegres(optimalWeights, train_x, train_y)
The test data is two-dimensional, with a total of 100 samples. There are two classes. As follows:
TestSet.txt
-0.01761214.0530640-1.3956344.6625411-0.7521576.5386200-1.3223717.15285300.42336311.05467700.4067047.06733510.66739412.7414520-2.4601506.86680510.5694119.5487550-0.02663210.42774300.8504336.92033411.34718313.17550001.1768133.1670201-1.7818719.0979530-0.5666065.74900310.9316351.5895051-0.0242056.1518231-0.0364532.6909881-0.1969490.44416511.0144595.75439911.9852983.2306191-1.693453-0.5575401-0.57652511.7789220-0.346811-1.6787301-2.1244842.67247111.2179169.5970150-0.7339289.0986870-3.642001-1.61808710.3159853.52395311.4166149.6192320-0.3863233.98928610.5569218.29498411.22486311.5873600-1.347803-2.40605111.1966044.95185110.2752219.54364700.4705759.3324880-1.8895679.5426620-1.52789312.1505790-1.18524711.3093180-0.4456783.29730311.0422226.1051551-0.61878710.32098601.1520830.54846710.8285342.6760451-1.23772810.5490330-0.683565-2.16612510.2294565.9219381-0.95988511.55533600.49291110.99332400.1849928.7214880-0.35571510.3259760-0.3978228.05839700.82483913.73034301.5072785.02786610.0996716.8358391-0.34400810.71748501.7859287.7186451-0.91880111.5602170-0.3640094.7473001-0.8417224.11908310.4904261.9605391-0.0071949.07579200.35610712.44786300.34257812.2811620-0.810823-1.46601812.5307776.47680111.29668311.60755900.47548712.0400350-0.78327711.00972500.07479811.0236500-1.3374720.4683391-0.10278113.7636510-0.1473242.87484610.5183899.88703501.0153997.5718820-1.658086-0.02725511.3199442.17122812.0562165.0199811-0.8516334.3756911-1.5100476.0619920-1.076637-3.18188811.82109610.28399003.0101508.4017661-1.0994581.6882741-0.834872-1.7338691-0.8466373.84907511.40010212.62878101.7528425.46816610.0785570.05973610.089392-0.71530011.82566212.69380800.1974459.74463800.1261170.9223111-0.6797971.22053010.6779832.55666610.76134910.6938620-2.1687910.14363211.3886109.34199700.31702914.7390250
Training results:
(A) Gradient Descent algorithms are iterated 500 times. (B) the random gradient descent algorithm iterates 200 times. (C) The improved random gradient descent algorithm iterates 20 times. (D) The improved random gradient descent algorithm iterates 200 times.
V. References
[1] Logisticregression (Logistic regression) Overview
[2] LogisticRegression Basics
[3] LogisticRegression