Python machine learning and practice--Introduction 3 (Logistic regression) _

Python machine learning and practice--Introduction 3 (Logistic regression) __python

Last Update:2018-07-28 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The first two articles are introductory texts detailing the complete "benign/malignant breast cancer prediction" problem of the Python source code.

According to the previous two descriptions, we can determine that the "benign/malignant breast cancer prediction" problem belongs to the two classification task. The categories to be predicted were benign breast cancer tumors and malignant breast cancer tumors. Typically, we use discrete integers to represent categories. As shown in the following table, the "Tumor type" column lists the types of tumors, 0 represents benign, and 1 represents malignancy.

	Lump thickness	Cell size	Type of tumor		Lump thickness	Cell size	Type of tumor
0	1	1	0	3	8	8	0
1	4	4	0	4	1	1	0
2	1	1	0	5	10	10	1

The complete dataset tumor features more than two, but in this case we take only these two features and the test set number is 175. Let's take a look at the distribution of the 175 tumor samples in two-dimensional feature space, as shown in the following figure. x represents malignancy, O represents benign tumor.

The code to draw this picture is as follows:

#-*-coding:utf-8-*-# import Pandas package, alias PD import pandas as PD # Use Pandas Read_csv function to read the training set to the variable df_tr Ain Df_train = pd.read_csv (' breast-cancer-train.csv ') # uses Pandas's read_csv function to read the test set to the variable df_test df_test = pd.read_csv (' Breast-cancer-test.csv ') # Select clump thickness and cell size as features to construct positive and negative classification samples in the test set df_test_negative = Df_test.loc[df_test[' Type '] = = = 0][[' Clump Thickness ', ' Cell Size ']] df_test_positive = df_test.loc[df_test[' type '] = = 1][[' Clump Thickness ', ' C Ell Size ']] # import the Pyplot in the Matplotlib Toolkit and name it as a benign tumor sample point in the PLT import Matplotlib.pyplot as Plt # Plot, marked with a red o plt.scatter (df_test_
negative[' Clump Thickness '], df_test_negative[' Cell Size ', marker= ' o ', s=200, c= ' Red ') # Plot A sample of the disgusting tumor in the map, marked as Black X Plt.scatter (df_test_positive[' Clump Thickness '], df_test_positive[' Cell Size '], marker= ' x ', s=150, c= ' black ') # Draw X, Y-Axis Description Plt.xlabel (' Clump Thickness ') plt.ylabel (' Cell Size ') # show Figure plt.show ()

Then we randomly initialize a class two classifier, which divides the benign/malignant tumor with a straight line. There are two factors that determine the direction of this line: the slope and intercept of the line. These are the parameters we call models, and the classifiers need to be learned from the training data. Initially, the performance of the classifier with random initialization parameters is shown in the following illustration:

The code to draw this picture is as follows:

#-*-Coding:utf-8-*-# import Pandas package, alias for PD import pandas as PD # import matplotlib in the Pyplot Toolkit and name it as the PLT import Matplotlib.pyplot as PLT # Use the Pandas Read_csv function to read the training set to the variable df_train df_train = pd.read_csv (' breast-cancer-train.csv ') # using Pandas Read_ CSV function, read the test set to the variable df_test df_test = pd.read_csv (' breast-cancer-test.csv ') # Select clump thickness and cell size as characteristics, Construct a positive and negative classification sample df_test_negative = df_test.loc[df_test[' Type '] = = 0][[' Clump Thickness ', ' Cell Size '] df_test_positive = d f_test.loc[df_test[' Type '] = = 1][[' Clump Thickness ', ' Cell Size '] # import NumPy Toolkit, rename to NP import NumPy as NP # take advantage of NumPy in Rando  Intercept and coefficients intercept = Np.random.random ([1]) Coef = Np.random.random ([2]) lx = np.arange (0,) ly = (-INTERCEPT-LX * Coef[0])/coef[1] # Draw a random line plt.plot (LX, ly, c= ' yellow ') plt.scatter (df_test_negative[' Clump Thickness '), Df_test_neg ative[' Cell Size '], marker= ' o ', s=200, c= ' Red ') plt.scatter (df_test_positive[' Clump Thickness '), df_test_positive[' Cell Size '], marker= ' x ', s=150, c= ' black ' Plt.xlabel(' Clump Thickness ') plt.ylabel (' Cell Size ') plt.show ()

Then we use a certain amount of training samples, the classifier performance has a large number of hints, the following figure:

The code to draw this picture is as follows:

#-*-Coding:utf-8-*-# import Pandas package, alias for PD import pandas as PD # import NumPy Toolkit, renamed to NP import NumPy as NP # import Matplotlib Toolkit Lot and named as PLT import Matplotlib.pyplot as PLT # import Sklearn in the logistic regression classifier from Sklearn.linear_model import Logisticregression # Use Pandas's read_csv function, which reads the training set to the variable df_train df_train = pd.read_csv (' breast-cancer-train.csv ') uses Pandas's read_csv function,
Read the test set and coexist to the variable df_test df_test = pd.read_csv (' breast-cancer-test.csv ') # Select clump thickness and cell size as features to construct positive and negative samples in the test set df_test_negative = df_test.loc[df_test[' Type '] = = 0][[' Clump Thickness ', ' Cell Size ']] df_test_positive = DF_TEST.LOC[DF _test[' Type '] = = 1][[' Clump Thickness ', ' Cell Size ']] lr = logisticregression () # Use the first 10 training samples to learn the coefficients of the line and intercept Lr.fit (df_train[[' Clump Thickness ', ' Cell Size ']][:10], df_train[' Type '][:10]) print ' testing accuracy (samples): ', Training (DF _test[[' Clump Thickness ', ' Cell Size ']], df_test[' Type '] intercept = lr.intercept_ Coef = lr.coef_[0,:] lx = np.arange (0 , 12) # Originally this classification surface should be lx*coef[0] + ly*coef[1] + After the intercept=0 is mapped to the 2-D plane, it should be: ly = (-INTERCEPT-LX * coef[0])/coef[1] Plt.plot (LX, ly, c= ' green ') Plt.scatter (df_test_ne gative[' Clump Thickness '], df_test_negative[' Cell Size ', marker= ' o ', s=200, c= ' Red ') plt.scatter (df_test_positive['
Clump Thickness '], df_test_positive[' Cell Size ', marker= ' x ', s=150, c= ' black ') plt.xlabel (' Clump Thickness ') Plt.ylabel (' Cell Size ') plt.show ()

The value of print is: testing accuracy (training samples): 0.868571428571
As shown in the figure above, when 10 training samples were learned, the classifier's performance improved a bit, the accuracy of the classification on the test set was 86.9%; We continue to learn all the training samples, the classifier's performance further improved, the accuracy of the classification on the test set reached 93.7%, the following figure:

The code to draw this picture is as follows:

#-*-Coding:utf-8-*-# import Pandas package, alias for PD import pandas as PD # import NumPy Toolkit, renamed to NP import NumPy as NP # import Matplotlib Toolkit Lot and named as PLT import Matplotlib.pyplot as PLT # import Sklearn in the logistic regression classifier from Sklearn.linear_model import Logisticregression # Use Pandas's read_csv function, which reads the training set to the variable df_train df_train = pd.read_csv (' breast-cancer-train.csv ') uses Pandas's read_csv function,
Read the test set and coexist to the variable df_test df_test = pd.read_csv (' breast-cancer-test.csv ') # Select clump thickness and cell size as features to construct positive and negative samples in the test set df_test_negative = df_test.loc[df_test[' Type '] = = 0][[' Clump Thickness ', ' Cell Size ']] df_test_positive = DF_TEST.LOC[DF _test[' Type '] = = 1][[' Clump Thickness ', ' Cell Size ']] lr = logisticregression () # Use the first 10 training samples to learn the coefficients of the line and intercept Lr.fit (df_train[[' Clump Thickness ', ' Cell Size ']], df_train[' Type '] print ' testing accuracy (training samples): ', Lr.score (df_test[[' Cl UMP Thickness ', ' Cell Size ']], df_test[' Type '] intercept = lr.intercept_ Coef = lr.coef_[0,:] lx = np.arange (0, 12) # originally This category should be lx*coef[0] + ly*coef[1] + interceptAfter the =0 is mapped to the 2-D plane, it should be: ly = (-INTERCEPT-LX * coef[0])/coef[1] Plt.plot (LX, ly, c= ' green ') plt.scatter (df_test_negative[' Cl UMP Thickness '], df_test_negative[' Cell Size ', marker= ' o ', s=200, c= ' Red ') plt.scatter (df_test_positive[' clump Thickness '], df_test_positive[' Cell Size '], marker= ' x ', s=150, c= ' black ') plt.xlabel (' Clump Thickness ') Plt.ylabel ('
 Cell Size ') plt.show ()

The value of print is testing accuracy (training samples): 0.937142857143

This code is just to help you clear up the most basic Python programming elements, to facilitate the understanding and practice of the following examples.

Data address Http://pan.baidu.com/s/1jI00k8Q, you can go to download.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python machine learning and practice--Introduction 3 (Logistic regression) __python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python machine learning and practice--Introduction 3 (Logistic regression) __python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support