Python machine learning and practice--Introduction 3 (Logistic regression) __python

Source: Internet
Author: User

The first two articles are introductory texts detailing the complete "benign/malignant breast cancer prediction" problem of the Python source code.

According to the previous two descriptions, we can determine that the "benign/malignant breast cancer prediction" problem belongs to the two classification task. The categories to be predicted were benign breast cancer tumors and malignant breast cancer tumors. Typically, we use discrete integers to represent categories. As shown in the following table, the "Tumor type" column lists the types of tumors, 0 represents benign, and 1 represents malignancy.

Lump thickness Cell size Type of tumor Lump thickness Cell size Type of tumor
0 1 1 0 3 8 8 0
1 4 4 0 4 1 1 0
2 1 1 0 5 10 10 1

The complete dataset tumor features more than two, but in this case we take only these two features and the test set number is 175. Let's take a look at the distribution of the 175 tumor samples in two-dimensional feature space, as shown in the following figure. x represents malignancy, O represents benign tumor.


The code to draw this picture is as follows:

#-*-coding:utf-8-*-# import Pandas package, alias PD import pandas as PD # Use Pandas Read_csv function to read the training set to the variable df_tr Ain Df_train = pd.read_csv (' breast-cancer-train.csv ') # uses Pandas's read_csv function to read the test set to the variable df_test df_test = pd.read_csv (' Breast-cancer-test.csv ') # Select clump thickness and cell size as features to construct positive and negative classification samples in the test set df_test_negative = Df_test.loc[df_test[' Type '] = = = 0][[' Clump Thickness ', ' Cell Size ']] df_test_positive = df_test.loc[df_test[' type '] = = 1][[' Clump Thickness ', ' C Ell Size ']] # import the Pyplot in the Matplotlib Toolkit and name it as a benign tumor sample point in the PLT import Matplotlib.pyplot as Plt # Plot, marked with a red o plt.scatter (df_test_
negative[' Clump Thickness '], df_test_negative[' Cell Size ', marker= ' o ', s=200, c= ' Red ') # Plot A sample of the disgusting tumor in the map, marked as Black X Plt.scatter (df_test_positive[' Clump Thickness '], df_test_positive[' Cell Size '], marker= ' x ', s=150, c= ' black ') # Draw X, Y-Axis Description Plt.xlabel (' Clump Thickness ') plt.ylabel (' Cell Size ') # show Figure plt.show () 
Then we randomly initialize a class two classifier, which divides the benign/malignant tumor with a straight line. There are two factors that determine the direction of this line: the slope and intercept of the line. These are the parameters we call models, and the classifiers need to be learned from the training data. Initially, the performance of the classifier with random initialization parameters is shown in the following illustration:


The code to draw this picture is as follows:

#-*-Coding:utf-8-*-# import Pandas package, alias for PD import pandas as PD # import matplotlib in the Pyplot Toolkit and name it as the PLT import Matplotlib.pyplot as PLT # Use the Pandas Read_csv function to read the training set to the variable df_train df_train = pd.read_csv (' breast-cancer-train.csv ') # using Pandas Read_ CSV function, read the test set to the variable df_test df_test = pd.read_csv (' breast-cancer-test.csv ') # Select clump thickness and cell size as characteristics, Construct a positive and negative classification sample df_test_negative = df_test.loc[df_test[' Type '] = = 0][[' Clump Thickness ', ' Cell Size '] df_test_positive = d f_test.loc[df_test[' Type '] = = 1][[' Clump Thickness ', ' Cell Size '] # import NumPy Toolkit, rename to NP import NumPy as NP # take advantage of NumPy in Rando  Intercept and coefficients intercept = Np.random.random ([1]) Coef = Np.random.random ([2]) lx = np.arange (0,) ly = (-INTERCEPT-LX * Coef[0])/coef[1] # Draw a random line plt.plot (LX, ly, c= ' yellow ') plt.scatter (df_test_negative[' Clump Thickness '), Df_test_neg ative[' Cell Size '], marker= ' o ', s=200, c= ' Red ') plt.scatter (df_test_positive[' Clump Thickness '), df_test_positive[' Cell Size '], marker= ' x ', s=150, c= ' black ' Plt.xlabel(' Clump Thickness ') plt.ylabel (' Cell Size ') plt.show ()
 
Then we use a certain amount of training samples, the classifier performance has a large number of hints, the following figure:


The code to draw this picture is as follows:

#-*-Coding:utf-8-*-# import Pandas package, alias for PD import pandas as PD # import NumPy Toolkit, renamed to NP import NumPy as NP # import Matplotlib Toolkit Lot and named as PLT import Matplotlib.pyplot as PLT # import Sklearn in the logistic regression classifier from Sklearn.linear_model import Logisticregression # Use Pandas's read_csv function, which reads the training set to the variable df_train df_train = pd.read_csv (' breast-cancer-train.csv ') uses Pandas's read_csv function,
Read the test set and coexist to the variable df_test df_test = pd.read_csv (' breast-cancer-test.csv ') # Select clump thickness and cell size as features to construct positive and negative samples in the test set df_test_negative = df_test.loc[df_test[' Type '] = = 0][[' Clump Thickness ', ' Cell Size ']] df_test_positive = DF_TEST.LOC[DF _test[' Type '] = = 1][[' Clump Thickness ', ' Cell Size ']] lr = logisticregression () # Use the first 10 training samples to learn the coefficients of the line and intercept Lr.fit (df_train[[' Clump Thickness ', ' Cell Size ']][:10], df_train[' Type '][:10]) print ' testing accuracy (samples): ', Training (DF _test[[' Clump Thickness ', ' Cell Size ']], df_test[' Type '] intercept = lr.intercept_ Coef = lr.coef_[0,:] lx = np.arange (0 , 12) # Originally this classification surface should be lx*coef[0] + ly*coef[1] + After the intercept=0 is mapped to the 2-D plane, it should be: ly = (-INTERCEPT-LX * coef[0])/coef[1] Plt.plot (LX, ly, c= ' green ') Plt.scatter (df_test_ne gative[' Clump Thickness '], df_test_negative[' Cell Size ', marker= ' o ', s=200, c= ' Red ') plt.scatter (df_test_positive['
Clump Thickness '], df_test_positive[' Cell Size ', marker= ' x ', s=150, c= ' black ') plt.xlabel (' Clump Thickness ') Plt.ylabel (' Cell Size ') plt.show ()

The value of print is: testing accuracy (training samples): 0.868571428571
As shown in the figure above, when 10 training samples were learned, the classifier's performance improved a bit, the accuracy of the classification on the test set was 86.9%; We continue to learn all the training samples, the classifier's performance further improved, the accuracy of the classification on the test set reached 93.7%, the following figure:

The code to draw this picture is as follows:

#-*-Coding:utf-8-*-# import Pandas package, alias for PD import pandas as PD # import NumPy Toolkit, renamed to NP import NumPy as NP # import Matplotlib Toolkit Lot and named as PLT import Matplotlib.pyplot as PLT # import Sklearn in the logistic regression classifier from Sklearn.linear_model import Logisticregression # Use Pandas's read_csv function, which reads the training set to the variable df_train df_train = pd.read_csv (' breast-cancer-train.csv ') uses Pandas's read_csv function,
Read the test set and coexist to the variable df_test df_test = pd.read_csv (' breast-cancer-test.csv ') # Select clump thickness and cell size as features to construct positive and negative samples in the test set df_test_negative = df_test.loc[df_test[' Type '] = = 0][[' Clump Thickness ', ' Cell Size ']] df_test_positive = DF_TEST.LOC[DF _test[' Type '] = = 1][[' Clump Thickness ', ' Cell Size ']] lr = logisticregression () # Use the first 10 training samples to learn the coefficients of the line and intercept Lr.fit (df_train[[' Clump Thickness ', ' Cell Size ']], df_train[' Type '] print ' testing accuracy (training samples): ', Lr.score (df_test[[' Cl UMP Thickness ', ' Cell Size ']], df_test[' Type '] intercept = lr.intercept_ Coef = lr.coef_[0,:] lx = np.arange (0, 12) # originally This category should be lx*coef[0] + ly*coef[1] + interceptAfter the =0 is mapped to the 2-D plane, it should be: ly = (-INTERCEPT-LX * coef[0])/coef[1] Plt.plot (LX, ly, c= ' green ') plt.scatter (df_test_negative[' Cl UMP Thickness '], df_test_negative[' Cell Size ', marker= ' o ', s=200, c= ' Red ') plt.scatter (df_test_positive[' clump Thickness '], df_test_positive[' Cell Size '], marker= ' x ', s=150, c= ' black ') plt.xlabel (' Clump Thickness ') Plt.ylabel ('
 Cell Size ') plt.show ()
The value of print is testing accuracy (training samples): 0.937142857143

This code is just to help you clear up the most basic Python programming elements, to facilitate the understanding and practice of the following examples.

Data address Http://pan.baidu.com/s/1jI00k8Q, you can go to download.



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.