Python Scikit-learn Study notes-Iris model

Source: Internet
Author: User

Iris data is a simple and interesting data set. This data set comes from the fact that scientists have found three different subcategories of flowers on an island called Setosa,versicolor,virginica. But these three species are not very well-distinguished, so they measure different species from four angles: calyx length, calyx width, petal length, petal width, for quantitative analysis. Based on these four features, the data is a data set for multivariate analysis. Below, we use Sklearn to try to analyze this data set from different angles.


The first idea is this: each of these three different varieties presumably will have characteristics or some similarity. We might as well start by dividing these disorganized data into three categories and then mapping out each of them. If you follow this idea, then this problem becomes a cluster problem.

As a clustering problem, we can use the K-means model to solve. You can refer to this blog post. The URL is as follows:

http://blog.csdn.net/heavendai/article/details/7029465

First of all, a general understanding of K-means, this algorithm is a non-supervised model, that is, at the outset I can not tell it categories, let them classify themselves. So how to classify it? Let's say we first map it to European space.

Can be seen intuitively, the figure is divided into three categories. Then we make the assumption that each class has a central point where the majority of the points to the center of the distance should be less than the distance from the center point of other classes. Most of the reason is that because of the special case of consideration, we cannot deny the majority of the previous ones because of a few points alone. Based on this idea, we can determine the objective function to be optimized, and we assume that we classify n data into K categories:

The Rnk means that the classification is 1 at K and the remaining is 0. The rest of the specifics how to optimize here is not in detail said.


Let's see the implementation code

<span style= "Font-family:microsoft Yahei;" >from sklearn.cluster Import kmeansfrom sklearn Import Datasetsiris = Datasets.load_iris () X = Iris.datay = Iris.targetc Lf=kmeans (n_clusters=3) model=clf.fit (x) predicted=model.predict (x) </span>

This calls the Kmeans of the cluster, because we know that the three classes we let the clusters center point to be 3.

Besides the number of Kmeans, there are max_iter,n_init,init,precompute_distances and so on. The specific parameter meanings are explained in the following URLs:

Http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans

If you want to be more intuitive, the website has iris data on K-means a demo URL is as follows:


Our PO Out code:

<span style= "Font-family:microsoft Yahei;"  >print (__doc__) # Code Source:gaël varoquaux# Modified for documentation by Jaques grobler# LICENSE:BSD 3 clauseimport NumPy as Npimport Matplotlib.pyplot as Pltfrom mpl_toolkits.mplot3d import axes3dfrom sklearn.cluster import KMeansfrom s Klearn Import Datasetsnp.random.seed (5) centers = [[1, 1], [-1,-1], [1, -1]]iris = Datasets.load_iris () X = Iris.datay = IR              Is.targetestimators = {' K_means_iris_3 ': Kmeans (n_clusters=3), ' K_means_iris_8 ': Kmeans (n_clusters=8), ' K_means_iris_bad_init ': Kmeans (n_clusters=3, n_init=1, init= ' random ') }fignum = 1for name, est in estimators.items (): Fig = plt.figure (Fignum, Figsize= (4, 3)) PLT.CLF () ax = Axes3d (fi G, rect=[0, 0,. 1], elev=48, azim=134) Plt.cla () est.fit (X) labels = Est.labels_ ax.scatter (x[:, 3], x[:, 0], x[:, 2], C=labels.astype (np.float)) ax.w_xaxis.set_ticklabels ([]) ax.w_yaxis.set_ticklabels([]) ax.w_zaxis.set_ticklabels ([]) Ax.set_xlabel (' petal width ') ax.set_ylabel (' sepal length ') Ax.set_zlabel (' P Etal length ') Fignum = fignum + # Plot the ground truthfig = Plt.figure (Fignum, Figsize= (4, 3)) PLT.CLF () ax = Axes3d (fi G, rect=[0, 0,. 1], elev=48, azim=134) Plt.cla () for name, label in [(' Setosa ', 0), (' Versicolour ', 1)              , (' Virginica ', 2)]: Ax.text3d (x[y = = label, 3].mean (), x[y = = label, 0].mean () + 1.5, X[y = = label, 2].mean (), name, horizontalalignment= ' center ', Bbox=dict (alpha=.5, Ed Gecolor= ' W ', facecolor= ' W ')) # Reorder the labels to has colors matching the cluster Resultsy = Np.choose (y, [1, 2, 0]). As Type (np.float) ax.scatter (x[:, 3], x[:, 0], x[:, 2], c=y) ax.w_xaxis.set_ticklabels ([]) ax.w_yaxis.set_ticklabels ([]) Ax.w_zaxis.set_ticklabels ([]) Ax.set_xlabel (' petal width ') ax.set_ylabel (' sepal length ') Ax.set_zlabel (' Petal Length ') plt.show () </span>
This code with a lot of matplotlib function, will be different varieties of flower clustering and then calibrated out, and the figure is 3D can be viewed from different angles with the mouse, the effect is cool. Show several effects ~

Besides the idea of clustering, we can also think of it as a model of supervised learning, which is known for its categories and data. In this case, the problem of research becomes a classification problem. The model of classification problem can be solved by many lr,svm,dt. Here is a simple example of a decision tree.

Get a quick look at this model before you implement it. You can refer to the previous blog

http://blog.csdn.net/leo_is_ant/article/details/43565505

First, the decision tree is a tree structure, where each internal node represents a test on a property, each branch represents a test output, and each leaf node represents a category. This structure is built on the basis of known probabilities of occurrence, so when building a decision tree, we first select the features that maximize the separation of attributes (i.e. the most information-gain feature), and then decide whether to use the remaining datasets and feature sets to build subtrees based on the classification.

Let's take a look at the implementation code:

<span style= "Font-family:microsoft Yahei;" >from sklearn.datasets Import load_irisfrom sklearn.cross_validation import cross_val_scorefrom sklearn.tree Import DECISIONTREECLASSIFIERCLF = Decisiontreeclassifier (random_state=0) Iris = Load_iris () model=clf.fit (Iris.data, Iris.target) predicted=model.predict (iris.data) Score=cross_val_score (CLF, Iris.data, Iris.target, cv=10) </span >
In this, using the decision tree classifier, to do the classification. Note here, because it is a supervisory problem, so the Fit method needs to iris.target this variable yo ~

Finally, note that the Cross_val_score method is a cross-validation approach, in which the data is divided into CVS, one to train the rest to predict, and the resulting score to avoid overfitting.

Finally, the parameter URL of the decision tree classifier is given:

Http://scikit-learn.org/stable/modules/classes.html#module-sklearn.tree

Python Scikit-learn Study notes-Iris model

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.