"Scikit-learn" learning python to classify real-world data

Source: Internet
Author: User

Introduced

Can a machine tell the variety of flowers according to the photograph? In the machine learning angle, this is actually a classification problem, that is, the machine according to different varieties of flowers of the data to learn, so that it can be unmarked test image data classification.
This section, we still start from Scikit-learn, understand the basic classification principles, multi-hands practice.

Iris Data Set

The Iris flower DataSet is a classic cube introduced by Sir Ronald Fisher in 1936 and can be used as a sample of discriminant analysis (discriminant analyses). The dataset contains 50 samples of iris Flower's three varieties (Iris setosa, Iris virginica and Iris versicolor), and 4 feature parameters for each sample (sepals <sepals> width and Petals < The length and width of the petals>, in centimeters), Fisher used the dataset to develop a linear discriminant model to identify the species of flowers.
Based on Fisher's linear discriminant model, this data is integrated for the typical experimental cases of various classification techniques in machine learning.


Now we have to solve the classification problem is, when we see a new iris flower, we can successfully predict the new iris flower varieties according to the above measurement parameters.
We use the data of a given label to design a rule and then apply it to other samples to make predictions, which is a basic oversight problem (classification problem).
Because the iris DataSet has a small sample size and dimensions, it is easy to visualize and manipulate.

Visualization of data (visualization)

Scikit-learn comes with some classic datasets, such as the iris and digits datasets for classification, and the Boston house prices dataset for regression analysis.
You can load data in the following ways:

from sklearn import datasetsiris = datasets.load_iris()digits = datasets.load_digits()

The dataset is a dictionary structure, and the data is stored in the. data member, and the output label is stored in the. Target member.

Draw a scatter plot of any two-dimensional data.

Scatter plots of any two dimensions can be drawn in the following way, with the first dimension sepal length and the second dimensional data sepal width as an example:

From Sklearn import Datasetsimport Matplotlib.pyplot as Pltimport numpy as Npiris = Datasets.load_iris () Irisfeatures = IRI s["Data"]irisfeaturesname = iris["Feature_names"]irislabels = iris["target"]def scatter_plot (DIM1, dim2): for T,marker        , color in Zip (xrange (3), ">ox", "RGB"): # ZIP () accepts any number of sequence parameters, returns a tuple list of tuples # The first two-dimensional data for each species of iris flowers are drawn with different markers and colors # We plot each class in its own to get different colored markers Plt.scatter (irisfeatures[irislabels = = T,dim1 ], Irisfeatures[irislabels = = T,dim2],marker=marker,c=color) dim_meaning = {0: ' setal length ', 1: ' Seta L width ', 2: ' Petal length ', 3: ' Petal Width '} plt.xlabel (Dim_meaning.get (DIM1)) Plt.ylabel (Dim_meaning.get (dim2)) Plt.s Ubplot (231) Scatter_plot (0,1) plt.subplot (232) Scatter_plot (0,2) plt.subplot (233) Scatter_plot (0,3) plt.subplot (234) Scatter_plot (Plt.subplot) (235) Scatter_plot (1,3) plt.subplot (236) Scatter_plot (2,3) plt.show ()

Effect


Constructing a classification model to classify according to the threshold value of a certain dimension

If our goal is to differentiate between these three kinds of flowers, we can make some assumptions. For example, the length of the petals (petal length) seems to distinguish the iris setosa species from the other two flowers. We can use this to write a small code to see what the boundary of this attribute is:

petallength = irisfeatures[:,2] #select The third column,since the features is 150*4issetosa = (Irislabel s = = 0) #label 0 means iris setosamaxsetosaplength = Petallength[issetosa].max () minnonsetosaplength = Petallength[~isseto Sa].min () print (' Maximum of Setosa:{0} '. Format (maxsetosaplength)) print (' Minimum of Others:{0} '. Format ( Minnonsetosaplength) "The result is: Maximum of setosa:1.9 Minimum of others:3.0 '  

We can build a simple classification model based on the experimental results, if the petal length is less than 2, it is iris setosa flower, otherwise two other kinds of flowers.
The structure of this model is very simple and is determined by a dimension threshold of the data. We experiment to determine the optimal threshold for this dimension.
The above example separates Iris setosa flowers from the other two flowers easily, but we cannot immediately determine the optimal threshold for iris virginica flowers and iris versicolor flowers, and we have even found that We are unable to separate the two categories perfectly according to the thresholds of a given dimension.

Compare the accuracy rate to get the threshold value

Let's first choose the flowers that are not setosa.

irisFeatures = irisFeatures[~isSetosa]labels = irisLabels[~isSetosa]isVirginica = (labels == 2)    #label 2 means iris virginica

Here we are very dependent on NumPy for array operations, Issetosa is a Boolean array that we can use to select non-setosa flowers. Finally, we also construct a new Boolean array, Isvirginica.
Next, we write a loop applet for each dimension's features, and then look at which thresholds can be better accurate.

# Search the threshold between virginica and versicoloririsfeatures = Irisfeatures[~issetosa]labels = IR Islabels[~issetosa]isvirginica = (Labels = = 2) #label 2 means iris virginicabestaccuracy = -1.0for fi in xrange (Irisfea TURES.SHAPE[1]): Thresh = Irisfeatures[:,fi].copy () thresh.sort () for t in thresh:pred = (irisfeatures[:, FI] > t) acc = (pred = = Isvirginica). Mean () if ACC > bestaccuracy:bestaccuracy = ACC; Bestfeatureindex = fi; Bestthreshold = T;print ' best accuracy:\t\t ', Bestaccuracyprint ' best Feature index:\t ', Bestfeatureindexprint ' best threshold:\t\t ', Bestthreshold ' ' final result: Best accuracy:0.94best Feature index:3best threshold:1.6 ' ' 

Here we first sort each dimension and then remove any value from the dimension as a hypothesis of the threshold, and then calculate the consistency of this hypothetical Boolean sequence with the actual tag Boolean sequence, averaging, that is, the accuracy rate. After all the loops, the resulting thresholds and the corresponding dimensions are obtained.
Finally, we get the best model for the width of the fourth-dimensional petals of the petal width, and we can get this decision boundary decision boundary.

Evaluation Model-cross-examination

Above, we got a simple model and achieved a 94% correct rate for the training data, but this model parameter may be too optimized.
What we need is to evaluate the generalization capabilities of the model against the new data, so we need to keep some of the data for a more rigorous evaluation instead of using the training data to do the test data. To do this, we will keep a subset of the data for cross-examination.
So we get the training error and the test error, when the complex model, the probability of training accuracy is 100%, but the test results may be just a little better than a random guess.

Cross-examination

In many practical applications, data is not sufficient. In order to choose a better model, a cross-examination method can be used. The basic idea of cross-examination is to use the data repeatedly, to slice the given data, to combine the segmented data sets into training sets and test sets, and to conduct training, testing and model selection on this basis.

S-fold Cross-examination

The most commonly used is the S-fold cross-examination (S-fold crosses validation), as follows: firstly, the data is randomly divided into a subset of the same size as s disjoint, and then using the data training model of the S-1 subset to test the model with the remaining subset ; This process is repeated for the possible s selection, and finally the model with the smallest average test error in S-sub-evaluation is selected.


For example, we divide the data set into 5 parts, the 5-fold cross-examination. Next, we can generate a model for each fold, leaving 20% of the data to be tested.

Leave-one-out Cross-examination method

Leaving a cross check (Leave-one-out crosses validation) is a special case of S-fold cross-examination, which is the case when S is the capacity of a given data set.
We can pick a sample from the training data and get the model from the other training data, and finally see if the model can classify the selected sample correctly.

def learn_model (features,labels): Bestaccuracy = -1.0 for fi in Xrange (Features.shape[1]): Thresh = features[  :, Fi].copy () Thresh.sort () for t in thresh:pred = (Features[:,fi] > t) ACC = (pred                = = labels). mean () if ACC > bestaccuracy:bestaccuracy = ACC;                Bestfeatureindex = fi;    Bestthreshold = t; ' Print ' best accuracy:\t\t ', bestaccuracy print ' Best Feature index:\t ', bestfeatureindex print ' Best Threshold: \t\t ', Bestthreshold ' return {' Dim ': Bestfeatureindex, ' thresh ': bestthreshold, ' accuracy ': bestaccuracy}def apply_mo Del (features,labels,model): prediction = (features[:,model[' Dim ')] > model[' thresh ']) return prediction#--------- --cross validation-------------error = 0.0for ei in range (len (irisfeatures)): # Select All and the one at position ' ei ' : training = Np.ones (len (irisfeatures), bool) Training[ei] = False testing = ~training MoDel = Learn_model (irisfeatures[training], isvirginica[training]) predictions = Apply_model (irisfeatures[testing), Isvirginica[testing], model) Error + = Np.sum (predictions! = isvirginica[testing])

In the above procedure, we test a series of models with all the samples, and the final estimate shows the generalization ability of the model.

Summary

We need to pay attention to the balanced allocation of data when partitioning the data set on the face. If for a subset, all the data comes from one category, the result is not representative.
Based on the above discussion, we use a simple model to train the cross-examination process to give an estimate of the generalization ability of the model.

Reference documents

Wiki:iris Flower Data Set
Building machine learning Systems with Python

Reprint please indicate the author Jason Ding and its provenance
GitHub home page (http://jasonding1354.github.io/)
CSDN Blog (http://blog.csdn.net/jasonding1354)
Jane Book homepage (http://www.jianshu.com/users/2bd9b48f6ea8/latest_articles)

"Scikit-learn" learning python to classify real-world data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.