This article mainly introduces four knowledge points, which is also the content of my lecture.
1.PCA Dimension reduction operation;
PCA expansion pack of Sklearn in 2.Python;
3.Matplotlib subplot function to draw a child graph;
4. Through the Kmeans to the diabetes dataset clustering, and draw a child map.
Previous recommendation:
The Python data Mining course. Introduction to installing Python and crawler
"Python Data Mining Course" two. Kmeans clustering data analysis and Anaconda introduction
"Python Data Mining Course" three. Kmeans clustering code implementation, operations and optimization
"Python Data Mining Course" four. Analysis of DTC data of decision tree and Iris data set
"Python Data Mining Course" five. Linear regression knowledge and predictive diabetes cases
"Python Data Mining Course" six numpy, Pandas and Matplotlib package basics
I hope this article will help you, especially the students who have just come into contact with data mining and large data, and these basics are really important. If there are deficiencies or errors in the article, please Haihan ~
I. PCA dimensionality reduction
Reference article: http://blog.csdn.net/xl890727/article/details/16898315
Reference book: Introduction to Machine learning
The complexity of any classification and regression method depends on the number of inputs, but in order to reduce storage and computational time, we need to consider reducing the dimension of the problem, discarding irrelevant features. At the same time, when the data can be expressed in fewer dimensions without losing information, we can visually analyze the structure and outliers of the data drawing.
Feature dimensionality reduction refers to the use of a low latitude characteristics to express high latitude. There are generally two kinds of feature dimensionality reduction methods: Feature Selection (Feature Selection) and feature extraction (Feature extraction).
1. Feature selection is a subset selected from the characteristics of high latitude as a new feature. The optimal subset is to contribute the maximum correct rate with the least dimension, discard the unimportant dimensions, and use the appropriate error function, including in the forward selection (ForWord Selection) and in the backward selection (backward Selection).
2. Feature extraction is a new feature that maps the characteristics of high latitude through a function to low latitudes. The common feature extraction methods are PCA (principal component analysis) and LDA (linear discriminant analysis).
The following focuses on PCA.
The essence of dimensionality reduction is to learn a mapping function f:x->y, where X is the original data point, represented by n-dimensional vectors. Y is the R-dimensional vector after the data point mapping, where n>r. With this mapping method, data points in the high-dimensional space can be
principal Component Analysis(Principal Component ANALYSIS,PCA) is a commonly used linear dimensionality reduction data analysis method, whose essence is that the original feature can be linearly transformed and mapped to the low latitude space when the original feature is represented as well as possible.
PCA transforms a set of possible correlations into a set of linear unrelated variables by orthogonal transformation, and the converted group of variables is called principal component, which can be used to extract the main characteristic components of data, and is often used for dimensionality reduction of high-dimensional data.
The focus of this approach is: based on the research on the relationship between the variables, we can replace the original variables with fewer new variables, and these fewer new variables retain as many variables as possible, and ensure that the new indexes are independent of each other (information does not overlap).
graphic Explanation:The graph of two-dimensional sample is reduced to one-dimensional representation, the ideal situation is that the 1 restoration vector contains the most raw data information, select the red line, similar to the data ellipse long axis, the direction of the most discrete, the variance is the largest, contains the most information. There is little change in the direction of the short axis and weak interpretation of the data.
principle Explanation:
The following refers to a picture of xl890727 simple explanation, because I am very weak in mathematics, evil fill in.
PCA is one of the oldest techniques in multivariable analysis, which originates from the K-L transformation in communication theory.
The result is that the distance between the dots and n samples is minimal, which means that the N samples are represented by this point.
Detailed process:
The following is the main component analysis algorithm process, or that sentence: The math is too bad is a mishap, so refer to the Baidu Library, but also please haihan, oneself really have to strengthen mathematics.
Summarize the PCA steps as shown in the following illustration:
Recommended reference:
Http://blog.codinglabs.org/articles/pca-tutorial.html-by: Zhang Yang
Feature dimensionality reduction-PCA (Principal Component analysis)-xl890727
Principle and detailed steps of PCA-Baidu Library
two. The PCA expansion pack of Sklearn in Python
The following is an introduction to the method of PCA reduction in Sklearn, reference URL: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
Import Method:
From sklearn.decomposition import PCA
The call function is as follows, where the n_components=2 representation is reduced to 2 dimensions.
PCA = PCA (n_components=2)
For example, the following code for the PCA Dimension reduction operation:
Import NumPy as NP from
sklearn.decomposition import PCA
X = Np.array ([[1,-1], [-2,-1], [-3,-2], [1, 1], [2 ], [3, 2]])
PCA = PCA (n_components=2)
print PCA
pca.fit (X)
The output results are as follows:
PCA (Copy=true, n_components=2, Whiten=false)
[0.99244291 0.00755711]
Again, such as loading the Boston dataset, a total of 10 features, dimensionality reduction into two features:
#载入数据集 from
sklearn.datasets import Load_boston
d = Load_boston ()
x = d.data
y = d.target
print x [: Ten]
Print U ' shape: ', X.shape
#降维
import numpy as NP from
sklearn.decomposition import PCA
= PCA (n_ components=2)
NewData = Pca.fit_transform (x)
print U ' descending dimension data: '
print newdata[:4]
print u ' shape: ', Newdata.shape
The output is reduced to 2-d data, as shown below.
[[6.32000000e-03 1.80000000e+01 2.31000000e+00 0.00000000e+00 5.38000000e-01 6.57500000e+00 6.52000000e+01 4.09000000e+00 1.00000000e+00 2.96000000e+02 1.53000000e+01 3.96900000e+02 4.98000000e+00] [2.73100000 e-02 0.00000000e+00 7.07000000e+00 0.00000000e+00 4.69000000e-01 6.42100000e+00 7.89000000e+01 4.96710000e +00 2.00000000e+00 2.42000000e+02 1.78000000e+01 3.96900000e+02 9.14000000e+00] [2.72900000e-02 0.00000 000e+00 7.07000000e+00 0.00000000e+00 4.69000000e-01 7.18500000e+00 6.11000000e+01 4.96710000e+00 2.0000 0000e+00 2.42000000e+02 1.78000000e+01 3.92830000e+02 4.03000000e+00] [3.23700000e-02 0.00000000e+00 2.18 000000e+00 0.00000000e+00 4.58000000e-01 6.99800000e+00 4.58000000e+01 6.06220000e+00 3.00000000e+00 2.2
2000000e+02 1.87000000e+01 3.94630000e+02 2.94000000e+00]] shape: (506L, 13L) After descending dimension data: [[-119.81821283 5.56072403] [ -168.88993091-10.11419701] [ -169.31150637-14.07855395] [ -190.2305986-18.29993274]] shape: (506L, 2L)
It is recommended that you read the official documents, which can be learned, such as iris iris reduction.
three. Kmeans Clustering diabetes and descending dimension subplot drawing sub-graph
to draw a multi-child diagram
 &NBSP the matplotlib of common classes in the Figure -> Axes-> (line2d, Text, etc.). A figure object can contain multiple child graphs (Axes), and a Axes object represents a plot area in Matplotlib, which can be understood as a child graph. You can use subplot () to quickly draw a chart that contains multiple child graphs, which are called as follows:
subplot (NumRows, Numcols, plotnum)
subplot divides the entire drawing area into numrows rows * numcols columns, and then numbering each of the subregions in order from left to right, from top to bottom, with the number of subregions on the left numbered 1. If the three numbers of Numrows,numcols and Plotnum are less than 10, they can be abbreviated to an integer, such as subplot (323) and subplot (3,2,3) are the same. Subplot creates an Axis object in the area specified by Plotnum. If the newly created axis overlaps with the previously created axis, the previous axis will be deleted.
current charts and graphs can be obtained using GCF () and GCA (), which are the initials of the "Get Current Figure" and "get-current Axis" respectively. GCF () Gets the figure object that represents the chart, and GCA () Gets the Axes object that represents the child graph. Let's run the program in Python and then call GCF () and GCA () to see the current figure and axes objects.
Import NumPy as NP
import Matplotlib.pyplot as Plt
plt.figure (1) # Create Chart 1
plt.figure (2) # Create chart 2
ax1 = Plt.s Ubplot (211) # in Figure 2, create a child figure 1
ax2 = Plt.subplot (212) # in Chart 2, create a child figure 2
x = Np.linspace (0, 3,) for
I in Xrange (5):
plt.figure (1) # Select Chart 1
plt.plot (x, Np.exp (I*X/3))
Plt.sca (AX1) # Select a chart 2 of the child Figure 1
plt.plot (x, Np.sin (i*x))
Plt.sca (AX2) # Select the 2
plt.plot (x, Np.cos (i*x)) Plt.show () in
Figure 2
The output is shown in the following illustration:
Detailed Code
The following example is through the Kmeans clustering, the dataset is load_diabetes loaded into the diabetes dataset, and then using PCA to reduce the dimension of the dataset, reduced to two dimensions, and finally clustering for 2, 3, 4 and 5 classes, through the subplot display of the child graph.
#-*-Coding:utf-8-*-#糖尿病数据集 from sklearn.datasets import load_diabetes data = Load_diabetes () x = data.data Print x[: 4] y = data.target print y[:4] #KMeans聚类算法 from sklearn.cluster import kmeans #训练 CLF = Kmeans (n_clusters=2) Print CLF cl F.fit (x) #预测 pre = clf.predict (x) print pre[:10] #使用PCA降维操作 from sklearn.decomposition import PCA PCA = PCA (n_components= 2) NewData = Pca.fit_transform (x) print newdata[:4] L1 = [n[0] for N in newdata] L2 = [n[1] for N in NewData] #绘图 import NumPy as NP import Matplotlib.pyplot as PLT #用来正常显示中文标签 plt.rc (' font ', family= ' Simhei ', size=8) #plt. rcparams[' Font.sans -serif ']=[' Simhei '] #用来正常显示负号 plt.rcparams[' axes.unicode_minus ']=false p1 = plt.subplot (221) Plt.title (U "Kmeans cluster n=2 ") Plt.scatter (l1,l2,c=pre,marker=" s ") Plt.sca (p1) ################################### # cluster cluster number =3 CLF = Kmeans (n_cluste rs=3) Clf.fit (x) Pre = clf.predict (x) P2 = plt.subplot (222) plt.title ("Kmeans n=3") Plt.scatter (l1,l2,c=pre,marker= "s") p Lt.sca (p2) ################################### # cluster Cluster number = 4 CLF = Kmeans (n_clusters=4) clf.fit (x) Pre = clf.predict (x) p3 = plt.subplot (223) plt. Title ("Kmeans n=4") Plt.scatter (l1,l2,c=pre,marker= "+") Plt.sca (p3) ################################### # Clustering cluster number =5 CLF = Kmeans (n_clusters=5) clf.fit (x) Pre = Clf.predict (x) P4 = Plt.subplot (224) plt.title ("Kmeans n=5") Plt.scatter (L1,L2,c
=pre,marker= "+") Plt.sca (p4) #保存图片本地 plt.savefig (' power.png ', dpi=300) plt.show ()
The output is as good as the following figure, which is useful for experimental comparisons.
Finally I hope this article is helpful to you, especially my students and contact with data mining, machine learning bo friends. It was 24th, and it was finished in the middle of the night. Really too tired, Saturday to write the office, while the assessment finally ended, good tired, but fortunately, many lovely students, they are growing, experience a lot after all is good, her dimples no wine, I was drunk like a dog. Mr. Yang Refuels ~
(By:eastmount 2016-11-26 4:30 P.M. http://blog.csdn.net/eastmount/)