Start machine learning with Python (7: Logical regression classification) __python

Source: Internet
Author: User

It is mentioned in this series that using Python to start machine learning (3: Data fitting and generalized linear regression) mentions the regression algorithm for numerical prediction. The logical regression algorithm is essentially regression, but it introduces a logical function to help classify it. The practice found that the logical regression in the field of text classification performance is also very good. Now let's take a moment to find out. 1. Logical function

Suppose the dataset has n independent features, X1 to xn as a sample of n features. The objective of the conventional regression algorithm is to fit a polynomial function, which minimizes the error of the predicted value and the real value:


We hope that such an F (x) can have a good logical judgment property, preferably to be able to directly express the probability that a sample with feature X is divided into a certain class. For example, f (x) >0.5 can indicate that x is divided into positive class, F (x) <0.5 expression is divided into the inverse class. And we want F (x) to always be between [0, 1]. Is there such a function?

The sigmoid function appears. This function is defined as follows:


First, intuitively, the image of the sigmoid function is shown below (from http://computing.dcu.ie/~humphrys/Notes/Neural/sigmoid.html):


The sigmoid function has all the graceful features we need, its domain is in all real numbers, the range is between [0, 1], and the 0-point value is 0.5.

So, how do you turn f (x) into a sigmoid function? So that P (x) =1 the probability that a sample with feature X is divided into category 1, then P (x)/[1-p (x)] is defined as the concession ratio (odds ratio). Introduction logarithm:

The upper formula can easily be solved by the solution of P (x):


Now, we get the sigmoid function we need. Next, just like the usual linear regression, fit the n parameter c of the formula. 2. Test Data

Test data We still choose the 2M film review dataset at Cornell University's website.

In this dataset we have tested the KNN classification algorithm, naive Bayesian classification algorithm. Now let's take a look at the effect of the Luo-series regression classification algorithm in dealing with these affective classification problems.

In the same way, we read directly into the saved Movie_data.npy and movie_target.npy to save time. 3, Code and analysis

The code for the logical regression is as follows:

#-*-Coding:utf-8-*-from matplotlib import pyplot import scipy as SP import numpy as NP from matplotlib import Pylab F Rom sklearn.datasets import load_files from sklearn.cross_validation import train_test_split from sklearn.feature_ Extraction.text Import Countvectorizer from Sklearn.feature_extraction.text import Tfidfvectorizer from Sklearn.naive_  Bayes import MULTINOMIALNB from sklearn.metrics import Precision_recall_curve, Roc_curve, AUC from sklearn.metrics Import 
Classification_report from Sklearn.linear_model import logisticregression import Time start_time = Time.time () #绘制R/P curve def PLOT_PR (Auc_score, precision, recall, Label=none): Pylab.figure (Num=none, figsize= (6, 5)) Pylab.xlim ([0.0, 1.  0]) Pylab.ylim ([0.0, 1.0]) Pylab.xlabel (' Recall ') pylab.ylabel (' Precision ') pylab.title (' p/r ') (auc=%0.2f)/ %s '% (auc_score, label) Pylab.fill_between (recall, precision, alpha=0.5) Pylab.grid (True, linestyle= '-', color= ' 0.75 ') Pylab.plot (recall, PrecIsion, Lw=1) pylab.show () #读取 movie_data = sp.load (' movie_data.npy ') Movie_target = sp.load (' movie_target.npy ') x = movie_data y = movie_target #BOOL型特征下的向量空间模型, note that the test sample invokes the Transform interface Count_vec = tfidfvectorizer (binary = False, deco De_error = ' Ignore ', \ stop_words = ' 中文版 ') average = 0 Testnum = Ten for I in range (0, TESTN UM): #加载数据集, shard DataSet 80% Training, 20% test X_train, X_test, y_train, y_test\ = Train_test_split (Movie_data, Movie_target
    , test_size = 0.2) X_train = Count_vec.fit_transform (x_train) x_test = Count_vec.transform (x_test) #训练LR分类器 CLF = Logisticregression () clf.fit (X_train, y_train) y_pred = clf.predict (x_test) p = Np.mean (y_pred = = Y_  Test) print (p) Average + + = P #准确率与召回率 answer = Clf.predict_proba (x_test) [:, 1] precision, recall, thresholds = Precision_recall_curve (Y_test, answer), answer > 0.5 print (Classification_report (y_test, the), Target_na mes = [' Neg ', ' pos ']) pRint ("Average precision:", average/testnum) print ("Time Spent:", Time.time ()-start_time) PLOT_PR (0.5, precision, recal
 L, "POS")
The results of the code run are as follows:

0.8
0.817857142857
0.775
0.825
0.807142857143
0.789285714286
0.839285714286
0.846428571429
0.764285714286
0.771428571429
Precision Recall F1-score Support
Neg 0.74 0.80 0.77 132
POS 0.81 0.74 0.77 148
Avg/total 0.77 0.77 0.77 280
Average precision:0.803571428571
Time spent:9.651551961898804

First of all, we tested 10 groups of test samples, and finally counted the average of the accuracy rate. Another good test method is K-fold crossover test (k-fold). This will enable more accurate evaluation of the classifier's performance and the sensitivity of the classifier to noise.

Second, we look at the final figure, which is the P/R curve (Precition/recall) drawn using Precision_recall_curve. Combined with the P/R graph, we can have a further understanding of the logical regression.

As we said before, we usually use 0.5来 as a basis for dividing two classes. In combination with P/R analysis, the threshold selection can be more flexible and excellent.

As you can see in the figure above, if you choose a threshold that is too low, more test samples will be grouped into 1 categories. Therefore, the recall rate can be promoted, the accuracy rate is obviously sacrificed to the corresponding accuracy rate.

For example, in this case, I might choose 0.42 as the dividing value--because the accuracy and recall rate of the point is high.

Finally, give some of the better resources:

The blog of a female Couba in Zhejiang. The LR public lesson note by Stanford Andrew Teacher: http://blog.csdn.net/abcjennifer/article/details/7716281

A summary of the LR also good blog: http://xiamaogeng.blog.163.com/blog/static/1670023742013231197530/

sigmoid function Detailed: http://computing.dcu.ie/~humphrys/Notes/Neural/sigmoid.html


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.