Start machine learning with Python (7: Logistic regression classification)--GOOD!!

Last Update:2016-10-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

from:http://blog.csdn.net/lsldd/article/details/41551797

In this series of articles, it is mentioned that the use of Python to start machine learning (3: Data fitting and generalized linear regression) refers to the regression algorithm for numerical prediction. The logistic regression algorithm is essentially regression, but it introduces logic functions to help classify it. It is found in practice that logistic regression is also excellent in the field of text categorization. Now let's explore.

1. Logic functions

Assuming that the dataset has n independent features, X1 to Xn is the N feature of the sample. The goal of the conventional regression algorithm is to fit a polynomial function that minimizes the error between the predicted value and the true value:

And we hope that f (x) can have a good logical judgment, it is best to be able to directly express the probability that a sample with characteristic x is divided into a certain class. For example F (x) >0.5 can indicate that x is divided into positive class, F (x) <0.5 is divided into inverse class. And we want F (x) to always be between [0, 1]. Do you have such a function?

The sigmoid function appears. This function is defined as follows:

To get an intuitive look, the image of the sigmoid function is shown below (from http://computing.dcu.ie/~humphrys/Notes/Neural/sigmoid.html):

The sigmoid function has all the graceful features we need, its definition field in the whole real number, the range between [0, 1], and the 0 point value is 0.5.

So, how do you turn f (x) into a sigmoid function? P (x) =1 is the probability that a sample with a characteristic x is divided into category 1, and the/[1-p (x)] is defined as a concession ratio (odds ratio). To introduce the logarithm:

It is easy to get the P (x) out of the equation by the formula:

Now, we get the sigmoid function we need. Next, just like the usual linear regression, you can fit the n parameter C in the formula.

2. Test data

Test data We still choose the 2M film review DataSet from the Cornell University website.

In this data set we have tested the KNN classification algorithm, naive Bayesian classification algorithm. Now let's take a look at how the series regression classification algorithm works in dealing with this type of emotional classification.

Similarly, we read the saved Movie_data.npy and Movie_target.npy directly to save time.

3. Code and Analysis

The code for logistic regression is as follows:

[Python]View PlainCopy

#-*-Coding:utf-8-*-
From matplotlib import Pyplot
Import scipy as SP
Import NumPy as NP
From matplotlib import Pylab
From sklearn.datasets import load_files
From sklearn.cross_validation import train_test_split
From Sklearn.feature_extraction.text import Countvectorizer
From Sklearn.feature_extraction.text import Tfidfvectorizer
From Sklearn.naive_bayes import MULTINOMIALNB
From Sklearn.metrics import Precision_recall_curve, Roc_curve, AUC
From Sklearn.metrics import Classification_report
From Sklearn.linear_model import logisticregression
Import time
Start_time = Time.time ()
#绘制R/P Curve
def PLOT_PR (Auc_score, precision, recall, label=None):
Pylab.figure (num=None, figsize= (6, 5))
Pylab.xlim ([0.0, 1.0])
Pylab.ylim ([0.0, 1.0])
Pylab.xlabel (' Recall ')
Pylab.ylabel (' Precision ')
Pylab.title (' p/r (auc=%0.2f)/%s '% (auc_score, label))
Pylab.fill_between (recall, precision, alpha=0.5)
Pylab.grid (True, linestyle='-', color=' 0.75 ')
Pylab.plot (recall, precision, lw=1)
Pylab.show ()
#读取
Movie_data = Sp.load (' movie_data.npy ')
Movie_target = Sp.load (' movie_target.npy ')
x = Movie_data
y = Movie_target
#BOOL型特征下的向量空间模型, note that the test sample calls the transform interface
Count_vec = tfidfvectorizer (binary = False, decode_error = ' Ignore ', \
Stop_words = ' 中文版 ')
Average = 0
Testnum = Ten
For I in range (0, Testnum):
#加载数据集, slicing datasets 80% training, 20% testing
X_train, X_test, Y_train, y_test\
= Train_test_split (Movie_data, movie_target, test_size = 0.2)
X_train = Count_vec.fit_transform (X_train)
X_test = Count_vec.transform (x_test)
#训练LR分类器
CLF = Logisticregression ()
Clf.fit (X_train, Y_train)
y_pred = Clf.predict (x_test)
p = Np.mean (y_pred = = y_test)
Print (p)
Average + = P
#准确率与召回率
Answer = Clf.predict_proba (x_test) [:,1]
Precision, recall, thresholds = Precision_recall_curve (y_test, answer)
Report = answer > 0.5
Print (Classification_report (y_test, report, target_names = [' neg ', ' pos ']))
Print ("average precision:", Average/testnum)
Print ("Time Spent:", Time.time ()-start_time)
PLOT_PR (0.5, precision, recall, "pos")

The result of the code operation is as follows:

0.8
0.817857142857
0.775
0.825
0.807142857143
0.789285714286
0.839285714286
0.846428571429
0.764285714286
0.771428571429
Precision Recall F1-score Support
Neg 0.74 0.80 0.77 132
POS 0.81 0.74 0.77 148
Avg/total 0.77 0.77 0.77 280
Average precision:0.803571428571
Time spent:9.651551961898804

First of all, we tested 10 sets of test samples in succession, and then we counted the average of the accurate rate. Another good test method is K-fold cross-examination (K-fold). This makes it possible to evaluate the performance of the classifier more accurately and to investigate the sensitivity of the classifier to noise.

Next we look at the final figure, which is the P/R curve (Precition/recall) drawn using Precision_recall_curve. Combined with the P/R diagram, we can have a further understanding of logistic regression.

As we said earlier, we usually use 0.5来 as a basis for dividing two classes. In combination with P/R analysis, the selection of threshold value can be more flexible and excellent.

As you can see, if you choose a threshold that is too low, more test samples will be divided into 1 categories. Therefore, the recall rate can be improved, obviously the accuracy of the sacrifice of the corresponding accuracy rate.

In this example, perhaps I would choose 0.42 as the dividing value-because the accuracy and recall rates are high.

Finally, give some better resources:

The blog of a female genius in Zhejiang University! Documentation of the LR public lesson note from Stanford Teacher: http://blog.csdn.net/abcjennifer/article/details/7716281

A summary of LR also good blog: http://xiamaogeng.blog.163.com/blog/static/1670023742013231197530/

sigmoid function: http://computing.dcu.ie/~humphrys/Notes/Neural/sigmoid.html

Start machine learning with Python (7: Logistic regression classification)--GOOD!!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Start machine learning with Python (7: Logistic regression classification)--GOOD!!

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Start machine learning with Python (7: Logistic regression classification)--GOOD!!

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support