First, the cognition and application scenario of logistic regression
Logistic regression is a probabilistic nonlinear regression model, which is a study of the relationship between two classification observations and some influencing factors.
A multi-variable analysis method. The usual problem is to study whether a certain outcome occurs in some conditions, such as in medicine depending on the patient's symptoms
To determine if it suffers from a certain disease.
Second, LR classifier
The LR classifier, which is the logistic Regression Classifier.
In the classification case, the learned LR classifier is a set of weights, and when the data input of the test sample, the weight value and the test data are
According to the linear addition, here is a characteristic of each sample.
In the form of the sigmoid function, where the sigmoid function is defined as a domain, the most basic LR classifier is suitable for classifying two kinds of targets.
So the key problem of logistic regression is to study how to obtain this set of weights. This problem is done with maximum likelihood estimation.
Third, logistic regression model
Consider a vector with an independent variable, set the conditional rate to
The probability of the occurrence of the observed amount relative to an event. Then the logistic regression model can be expressed as
This is called the logistic function. which
Then the probability of not occurring under the condition is
So the ratio of events to the probability of not occurring is:
This ratio is called the occurrence ratio of the event (the odds of experiencing an event), précis-writers is odds.
Summary:
In general, regression is not a classification problem, because regression is a continuous model, and is affected by the noise is relatively large.
You can use logistic regression if you do not want to apply it to classification issues.
Logistic regression is essentially linear regression, except that a layer of function mappings is added to the mapping of feature to result.
That is, the feature is summed linearly, and then the function g (z) is used as a hypothetical function to predict. G (z) can map continuous values to 0 and 1.
The hypothetical function of logistic regression is as follows, and the linear regression hypothesis function is just.
Logistic regression is used to classify the 0/1 problem, which is the two value classification problem that the prediction result belongs to 0 or 1.
This assumes that the two value satisfies the Bernoulli distribution (0/1 distribution or two-point distribution), i.e.
Iv. Logistic regression application case
(1) Analysis of LOGISTICREGRESSIONCV function in Sklearn
(2) The code is as follows:
The file links are as follows: Link: https://pan.baidu.com/s/1dEWUEhb Password: bm1p
#!/usr/bin/env python #-*-coding:utf-8-*-# author:zhengzhengliu #乳腺癌分类案例 import Sklearn from Sklearn.linear_model im Port logisticregressioncv,linearregression from sklearn.model_selection import train_test_split
Sklearn.preprocessing Import Standardscaler from sklearn.linear_model.coordinate_descent import convergencewarning
Import NumPy as NP import pandas as PD import matplotlib as MPL import matplotlib.pyplot as PLT import warnings #解决中文显示问题
mpl.rcparams["Font.sans-serif"] = [u "Simhei"] mpl.rcparams["axes.unicode_minus"] = False #拦截异常 Warnings.filterwarnings (action= ' ignore ', category=convergencewarning) #导入数据并对异常数据进行清除 path = "datas/
Breast-cancer-wisconsin.data "names = [" id "," Clump Thickness "," uniformity of Cell Size "," uniformity of cell Shape " , "Marginal Adhesion", "single epithelial Cell Size", "Bare nuclei", "Bland chromatin", "Normal Nucleoli", "Mitoses" , "Class"] df = pd.read_csv (path,header=none,names=names) datas = Df.replace ("?", Np.nan). Dropna (how= "any") #只要列中有nan值, row delete operation #print (Datas.head ()) #默认显示前五行 #数据提取与数据分割 X = datas[names[1:10]] Y = datas[names[10]] #划分训练集与测试集 x_train,x_test,y_train,y_test = Train_test_split (x,y,test_size=0.1,random_state=0) #对数据的训练集进行标准化 SS = Standardscaler () X_train = Ss.fit_transform (x_train) #先拟合数据在进行标准化 #构建并训练模型 # # Multi_class: Category selection parameters, with "OVR (default)" and "Multi Nomial "Two values selectable, no difference in two-dollar logistic regression # # CV: Several folded cross-validation # # Solver: Optimization algorithm selection parameters, when penalty is" L1 ", the parameter can only be" liblinear (axis descent method) "# #" Lbfgs "and" CG " Are all about the Taylor expansion of the objective function # # When the penalty is "L2", the parameter can be "LBFGS (quasi-Newton method)", "NEWTON_CG (Newton Method Variant)", "seg (minibactch random average gradient descent)" # # Dimension <10000, Choose "Lbfgs" method, Dimension >10000, choose "CS" method is better, graphics card calculation, LBFGS "and" CS "is faster than" SEG "# # Penalty: Regularization selection parameters, used to solve the fit, optional" L1 "," L2 "# # Tol: When the target function drops to that value is stopped, called: tolerance, prevents excessive computation of LR = Logisticregressioncv (multi_class= "OVR", Fit_intercept=true,cs=np.logspace (-
2,2,20), cv=2,penalty= "L2", solver= "Lbfgs", tol=0.01) re = Lr.fit (x_train,y_train) #模型效果获取 r = Re.score (X_train,y_train) Print ("R value (accuracy):", R) print ("parameter:", re.coef_) print ("Intercept:", Re.intercept_) priNT ("Sparse feature ratio:%.2f%%"% (Np.mean (Lr.coef_.ravel () ==0)) print ("*100 function conversion value, i.e.: Probability =========sigmoid") print ( Re.predict_proba (x_test)) #sigmoid函数转化的值, i.e.: probability P #模型的保存与持久化 from sklearn.externals import joblib joblib.dump (ss, "Logist Ic_ss.model ") #将标准化模型保存 Joblib.dump (LR," Logistic_lr.model ") #将训练后的线性模型保存 joblib.load (" Logistic_ss.model ") # The model file is loaded joblib.load ("Logistic_lr.model") #预测 x_test = Ss.transform (x_test) #数据标准化