Start machine learning with Python (7: Logistic regression classification)--GOOD!!

Source: Internet
Author: User

from:http://blog.csdn.net/lsldd/article/details/41551797

In this series of articles, it is mentioned that the use of Python to start machine learning (3: Data fitting and generalized linear regression) refers to the regression algorithm for numerical prediction. The logistic regression algorithm is essentially regression, but it introduces logic functions to help classify it. It is found in practice that logistic regression is also excellent in the field of text categorization. Now let's explore.

1. Logic functions

Assuming that the dataset has n independent features, X1 to Xn is the N feature of the sample. The goal of the conventional regression algorithm is to fit a polynomial function that minimizes the error between the predicted value and the true value:

And we hope that f (x) can have a good logical judgment, it is best to be able to directly express the probability that a sample with characteristic x is divided into a certain class. For example F (x) >0.5 can indicate that x is divided into positive class, F (x) <0.5 is divided into inverse class. And we want F (x) to always be between [0, 1]. Do you have such a function?

The sigmoid function appears. This function is defined as follows:

To get an intuitive look, the image of the sigmoid function is shown below (from http://computing.dcu.ie/~humphrys/Notes/Neural/sigmoid.html):

The sigmoid function has all the graceful features we need, its definition field in the whole real number, the range between [0, 1], and the 0 point value is 0.5.

So, how do you turn f (x) into a sigmoid function? P (x) =1 is the probability that a sample with a characteristic x is divided into category 1, and the/[1-p (x)] is defined as a concession ratio (odds ratio). To introduce the logarithm:

It is easy to get the P (x) out of the equation by the formula:

Now, we get the sigmoid function we need. Next, just like the usual linear regression, you can fit the n parameter C in the formula.

2. Test data

Test data We still choose the 2M film review DataSet from the Cornell University website.

In this data set we have tested the KNN classification algorithm, naive Bayesian classification algorithm. Now let's take a look at how the series regression classification algorithm works in dealing with this type of emotional classification.

Similarly, we read the saved Movie_data.npy and Movie_target.npy directly to save time.

3. Code and Analysis

The code for logistic regression is as follows:

[Python]View PlainCopy
  1. #-*-Coding:utf-8-*-
  2. From matplotlib import Pyplot
  3. Import scipy as SP
  4. Import NumPy as NP
  5. From matplotlib import Pylab
  6. From sklearn.datasets import load_files
  7. From sklearn.cross_validation import train_test_split
  8. From Sklearn.feature_extraction.text import Countvectorizer
  9. From Sklearn.feature_extraction.text import Tfidfvectorizer
  10. From Sklearn.naive_bayes import MULTINOMIALNB
  11. From Sklearn.metrics import Precision_recall_curve, Roc_curve, AUC
  12. From Sklearn.metrics import Classification_report
  13. From Sklearn.linear_model import logisticregression
  14. Import time
  15. Start_time = Time.time ()
  16. #绘制R/P Curve
  17. def PLOT_PR (Auc_score, precision, recall, label=None):
  18. Pylab.figure (num=None, figsize= (6, 5))
  19. Pylab.xlim ([0.0, 1.0])
  20. Pylab.ylim ([0.0, 1.0])
  21. Pylab.xlabel (' Recall ')
  22. Pylab.ylabel (' Precision ')
  23. Pylab.title (' p/r (auc=%0.2f)/%s '% (auc_score, label))
  24. Pylab.fill_between (recall, precision, alpha=0.5)
  25. Pylab.grid (True, linestyle='-', color=' 0.75 ')
  26. Pylab.plot (recall, precision, lw=1)
  27. Pylab.show ()
  28. #读取
  29. Movie_data = Sp.load (' movie_data.npy ')
  30. Movie_target = Sp.load (' movie_target.npy ')
  31. x = Movie_data
  32. y = Movie_target
  33. #BOOL型特征下的向量空间模型, note that the test sample calls the transform interface
  34. Count_vec = tfidfvectorizer (binary = False, decode_error = ' Ignore ', \
  35. Stop_words = ' 中文版 ')
  36. Average = 0
  37. Testnum = Ten
  38. For I in range (0, Testnum):
  39. #加载数据集, slicing datasets 80% training, 20% testing
  40. X_train, X_test, Y_train, y_test\
  41. = Train_test_split (Movie_data, movie_target, test_size = 0.2)
  42. X_train = Count_vec.fit_transform (X_train)
  43. X_test = Count_vec.transform (x_test)
  44. #训练LR分类器
  45. CLF = Logisticregression ()
  46. Clf.fit (X_train, Y_train)
  47. y_pred = Clf.predict (x_test)
  48. p = Np.mean (y_pred = = y_test)
  49. Print (p)
  50. Average + = P
  51. #准确率与召回率
  52. Answer = Clf.predict_proba (x_test) [:,1]
  53. Precision, recall, thresholds = Precision_recall_curve (y_test, answer)
  54. Report = answer > 0.5
  55. Print (Classification_report (y_test, report, target_names = [' neg ', ' pos ']))
  56. Print ("average precision:", Average/testnum)
  57. Print ("Time Spent:", Time.time ()-start_time)
  58. PLOT_PR (0.5, precision, recall, "pos")

The result of the code operation is as follows:

0.8
0.817857142857
0.775
0.825
0.807142857143
0.789285714286
0.839285714286
0.846428571429
0.764285714286
0.771428571429
Precision Recall F1-score Support
Neg 0.74 0.80 0.77 132
POS 0.81 0.74 0.77 148
Avg/total 0.77 0.77 0.77 280
Average precision:0.803571428571
Time spent:9.651551961898804

First of all, we tested 10 sets of test samples in succession, and then we counted the average of the accurate rate. Another good test method is K-fold cross-examination (K-fold). This makes it possible to evaluate the performance of the classifier more accurately and to investigate the sensitivity of the classifier to noise.

Next we look at the final figure, which is the P/R curve (Precition/recall) drawn using Precision_recall_curve. Combined with the P/R diagram, we can have a further understanding of logistic regression.

As we said earlier, we usually use 0.5来 as a basis for dividing two classes. In combination with P/R analysis, the selection of threshold value can be more flexible and excellent.

As you can see, if you choose a threshold that is too low, more test samples will be divided into 1 categories. Therefore, the recall rate can be improved, obviously the accuracy of the sacrifice of the corresponding accuracy rate.

In this example, perhaps I would choose 0.42 as the dividing value-because the accuracy and recall rates are high.

Finally, give some better resources:

The blog of a female genius in Zhejiang University! Documentation of the LR public lesson note from Stanford Teacher: http://blog.csdn.net/abcjennifer/article/details/7716281

A summary of LR also good blog: http://xiamaogeng.blog.163.com/blog/static/1670023742013231197530/

sigmoid function: http://computing.dcu.ie/~humphrys/Notes/Neural/sigmoid.html

Start machine learning with Python (7: Logistic regression classification)--GOOD!!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.