from:http://blog.csdn.net/lsldd/article/details/41551797
In this series of articles, it is mentioned that the use of Python to start machine learning (3: Data fitting and generalized linear regression) refers to the regression algorithm for numerical prediction. The logistic regression algorithm is essentially regression, but it introduces logic functions to help classify it. It is found in practice that logistic regression is also excellent in the field of text categorization. Now let's explore.
1. Logic functions
Assuming that the dataset has n independent features, X1 to Xn is the N feature of the sample. The goal of the conventional regression algorithm is to fit a polynomial function that minimizes the error between the predicted value and the true value:
And we hope that f (x) can have a good logical judgment, it is best to be able to directly express the probability that a sample with characteristic x is divided into a certain class. For example F (x) >0.5 can indicate that x is divided into positive class, F (x) <0.5 is divided into inverse class. And we want F (x) to always be between [0, 1]. Do you have such a function?
The sigmoid function appears. This function is defined as follows:
To get an intuitive look, the image of the sigmoid function is shown below (from http://computing.dcu.ie/~humphrys/Notes/Neural/sigmoid.html):
The sigmoid function has all the graceful features we need, its definition field in the whole real number, the range between [0, 1], and the 0 point value is 0.5.
So, how do you turn f (x) into a sigmoid function? P (x) =1 is the probability that a sample with a characteristic x is divided into category 1, and the/[1-p (x)] is defined as a concession ratio (odds ratio). To introduce the logarithm:
It is easy to get the P (x) out of the equation by the formula:
Now, we get the sigmoid function we need. Next, just like the usual linear regression, you can fit the n parameter C in the formula.
2. Test data
Test data We still choose the 2M film review DataSet from the Cornell University website.
In this data set we have tested the KNN classification algorithm, naive Bayesian classification algorithm. Now let's take a look at how the series regression classification algorithm works in dealing with this type of emotional classification.
Similarly, we read the saved Movie_data.npy and Movie_target.npy directly to save time.
3. Code and Analysis
The code for logistic regression is as follows:
[Python]View PlainCopy
- #-*-Coding:utf-8-*-
- From matplotlib import Pyplot
- Import scipy as SP
- Import NumPy as NP
- From matplotlib import Pylab
- From sklearn.datasets import load_files
- From sklearn.cross_validation import train_test_split
- From Sklearn.feature_extraction.text import Countvectorizer
- From Sklearn.feature_extraction.text import Tfidfvectorizer
- From Sklearn.naive_bayes import MULTINOMIALNB
- From Sklearn.metrics import Precision_recall_curve, Roc_curve, AUC
- From Sklearn.metrics import Classification_report
- From Sklearn.linear_model import logisticregression
- Import time
- Start_time = Time.time ()
- #绘制R/P Curve
- def PLOT_PR (Auc_score, precision, recall, label=None):
- Pylab.figure (num=None, figsize= (6, 5))
- Pylab.xlim ([0.0, 1.0])
- Pylab.ylim ([0.0, 1.0])
- Pylab.xlabel (' Recall ')
- Pylab.ylabel (' Precision ')
- Pylab.title (' p/r (auc=%0.2f)/%s '% (auc_score, label))
- Pylab.fill_between (recall, precision, alpha=0.5)
- Pylab.grid (True, linestyle='-', color=' 0.75 ')
- Pylab.plot (recall, precision, lw=1)
- Pylab.show ()
- #读取
- Movie_data = Sp.load (' movie_data.npy ')
- Movie_target = Sp.load (' movie_target.npy ')
- x = Movie_data
- y = Movie_target
- #BOOL型特征下的向量空间模型, note that the test sample calls the transform interface
- Count_vec = tfidfvectorizer (binary = False, decode_error = ' Ignore ', \
- Stop_words = ' 中文版 ')
- Average = 0
- Testnum = Ten
- For I in range (0, Testnum):
- #加载数据集, slicing datasets 80% training, 20% testing
- X_train, X_test, Y_train, y_test\
- = Train_test_split (Movie_data, movie_target, test_size = 0.2)
- X_train = Count_vec.fit_transform (X_train)
- X_test = Count_vec.transform (x_test)
- #训练LR分类器
- CLF = Logisticregression ()
- Clf.fit (X_train, Y_train)
- y_pred = Clf.predict (x_test)
- p = Np.mean (y_pred = = y_test)
- Print (p)
- Average + = P
- #准确率与召回率
- Answer = Clf.predict_proba (x_test) [:,1]
- Precision, recall, thresholds = Precision_recall_curve (y_test, answer)
- Report = answer > 0.5
- Print (Classification_report (y_test, report, target_names = [' neg ', ' pos ']))
- Print ("average precision:", Average/testnum)
- Print ("Time Spent:", Time.time ()-start_time)
- PLOT_PR (0.5, precision, recall, "pos")
The result of the code operation is as follows:
0.8
0.817857142857
0.775
0.825
0.807142857143
0.789285714286
0.839285714286
0.846428571429
0.764285714286
0.771428571429
Precision Recall F1-score Support
Neg 0.74 0.80 0.77 132
POS 0.81 0.74 0.77 148
Avg/total 0.77 0.77 0.77 280
Average precision:0.803571428571
Time spent:9.651551961898804
First of all, we tested 10 sets of test samples in succession, and then we counted the average of the accurate rate. Another good test method is K-fold cross-examination (K-fold). This makes it possible to evaluate the performance of the classifier more accurately and to investigate the sensitivity of the classifier to noise.
Next we look at the final figure, which is the P/R curve (Precition/recall) drawn using Precision_recall_curve. Combined with the P/R diagram, we can have a further understanding of logistic regression.
As we said earlier, we usually use 0.5来 as a basis for dividing two classes. In combination with P/R analysis, the selection of threshold value can be more flexible and excellent.
As you can see, if you choose a threshold that is too low, more test samples will be divided into 1 categories. Therefore, the recall rate can be improved, obviously the accuracy of the sacrifice of the corresponding accuracy rate.
In this example, perhaps I would choose 0.42 as the dividing value-because the accuracy and recall rates are high.
Finally, give some better resources:
The blog of a female genius in Zhejiang University! Documentation of the LR public lesson note from Stanford Teacher: http://blog.csdn.net/abcjennifer/article/details/7716281
A summary of LR also good blog: http://xiamaogeng.blog.163.com/blog/static/1670023742013231197530/
sigmoid function: http://computing.dcu.ie/~humphrys/Notes/Neural/sigmoid.html
Start machine learning with Python (7: Logistic regression classification)--GOOD!!