Using Python3 to learn the API of linear regression
Prediction of benign and malignant tumors using logistic regression and stochastic parameter estimation regression respectively
I downloaded the dataset locally and can come to my git to download the source code and dataset:Https://github.com/linyi0604/kaggle
1 ImportNumPy as NP2 ImportPandas as PD3 fromSklearn.cross_validationImportTrain_test_split4 fromSklearn.preprocessingImportStandardscaler5 fromSklearn.linear_modelImportlogisticregression, Sgdclassifier6 fromSklearn.metricsImportClassification_report7 8 " "9 linear classifierTen The most basic and commonly used machine learning model One linear assumptions constrained by data characteristics and classification targets A Logistic regression computation time is long, model performance is slightly higher - short calculation time of stochastic parameters and slightly lower performance of the model - " " the - " " - 1 Data preprocessing - " " + #Create a Feature list -Column_names = ['Sample Code number','Clump Thickness','uniformity of Cell Size', + 'uniformity of Cell Shape','Marginal Adhesion','Single epithelial Cell size', A 'Bare Nuclei','Bland Chromatin','Normal Nucleoli','Mitoses','Class'] at #using PANDAS.READ_CSV to fetch datasets -data = Pd.read_csv ('./data/breast/breast-cancer-wisconsin.data', names=column_names) - #Replace with a standard missing value representation -data = Data.replace (to_replace='?', value=Np.nan) - #loss of data with missing values discarded as long as there is a missing dimension -data = Data.dropna (how=' any') in #the number and dimensions of the output data - #print (Data.shape) to + - " " the 2 preparation of benign and malignant tumors training, test data section * " " $ #random Sample 25% data for testing 75% data for trainingPanax NotoginsengX_train, X_test, y_train, y_test = Train_test_split (data[column_names[1:10]], -Data[column_names[10]], thetest_size=0.25, +Random_state=33) A #identification of the number and type distribution of training samples and test samples the #print (y_train.value_counts ()) + #print (y_test.value_counts ()) - " " $ Training Samples Total 512 of them 344 benign tumors 168 malignant tumors $ 2 344 - 4 168 - Name:class, Dtype:int64 the test Data Total 171 of them 100 benign tumors 71 malignant tumors - 2Wuyi 4 the Name:class, Dtype:int64 - " " Wu - About " " $ 3 machine learning models for predictive parts - " " - #data normalization to ensure that the variance of each dimension feature is 1 mean 0 The predicted result will not be dominated by the eigenvalues of some dimensions -SS =Standardscaler () AX_train = Ss.fit_transform (X_train)#standardize the X_train +X_test = Ss.transform (x_test)#standardize the x_test with the same rules as x_train, without re-establishing the rules the - #the two methods of logistic regression and stochastic parameter estimation were used to predict learning . $ theLR = Logisticregression ()#Initialize logistic regression model theSGDC = Sgdclassifier ()#initialization of stochastic parameter estimation model the the #use logistic regression to train on training sets - Lr.fit (X_train, Y_train) in #after training, the prediction results of the test set are saved in Lr_y_predict. theLr_y_predict =lr.predict (x_test) the About #use random parameter estimation to train on training sets the Sgdc.fit (X_train, Y_train) the #after training, the prediction results of the test set are saved in Sgdc_y_predict. theSgdc_y_predict =sgdc.predict (x_test) + - " " the 4 Performance Analysis SectionBayi " " the #Logistic regression model with scoring function score to obtain the accuracy rate of the model on the test set the Print("Logistic regression accuracy rate:", Lr.score (X_test, y_test)) - #other metrics for logistic regression - Print("other indicators for logistic regression: \ n", Classification_report (Y_test, Lr_y_predict, target_names=["Benign","Malignant"])) the the #performance analysis of stochastic parameter estimation the Print("estimation accuracy of stochastic parameters:", Sgdc.score (X_test, y_test)) the #Other indicators of stochastic parameter estimation - Print("Other indicators for stochastic parameter estimation: \ n", Classification_report (Y_test, Sgdc_y_predict, target_names=["Benign","Malignant"])) the the " " the Recall Recall Rate94 Precision Accuracy Rate the Fl-score the Support the 98 Logistic regression accuracy rate: 0.9707602339181286 About Other indicators of logistic regression: - Precision recall F1-score support101 102 benign 0.96 0.99 0.98103 Malignant 0.99 0.94 0.96104 the avg/total 0.97 0.97 0.97 171106 107 estimation accuracy of stochastic parameters: 0.9649122807017544108 Other indicators of stochastic parameter estimation:109 Precision recall F1-score support the 111 benign 0.97 0.97 0.97 the malignant 0.96 0.96 0.96113 the avg/total 0.96 0.96 0.96 171 the " "
The path of machine learning: A python linear regression classifier for predicting benign and malignant tumors