Machine learning Scikit-learn Getting Started Tutorial

Last Update:2015-07-20 Source: Internet

Author: User

Tags svm type casting

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original link: http://scikit-learn.github.io/dev/tutorial/basic/tutorial.html

Chapter Content

In this chapter, we mainly introduce the Scikit-learn machine learning Thesaurus, and will give you a learning sample.

Machine Learning: Problem setting

In general, a learning problem is learning through a series of n sample data and then trying to predict the properties of unknown data. If each sample exceeds a single value, such as multidimensional input (also called multi-dimensional data), then it has multiple characteristics.

We can divide the learning problem into several big ones:

Supervised learning: In supervised learning, this data comes with the additional attributes we want to predict (Scikit-learn supervised learning links), which include:
- Classification: A sample belongs to two or more classes, and we want to predict the category of unknown data from the data that has been tagged. An example of a classification problem is hand writing recognition. The purpose of this example is to identify the category of input vectors from some categories. Another idea for classification is as a separate form of supervised learning (not sequential), in which one is a restricted number of categories, and there are n samples provided for each category, and one is to try to tag them with the correct category or class.
- Regression: If the desired output is made up of one or more contiguous variables, it is called regression. Examples of regression problems would predict the length of a salmon by its age and weight.
Unsupervised learning: In unsupervised learning, the training data consists of a set of input vectors that do not have any category tag values. The purpose of this problem is that similar sample groups may be found in these data, and these similar examples are called clusters. Or in the input space to determine the distribution of data, called density estimation, or the data from the high-dimensional space mapping to two-dimensional or three-dimensional space, called the data visualization problem. (Unsupervised learning links)

Training sets and test sets

Machine learning is about learning some of the properties of a dataset and then applying them to new data. This is why it is common practice to evaluate an algorithm in machine learning by splitting the dataset into two datasets, one of which is called the training set, which is used to learn the properties of the data, and the other is called the test set, which tests those properties on the test set.

loading a sample data set

Scikit-learn comes with some standard datasets, such as the iris and digit datasets for classification and the Boston house prices dataset for regression.

Below, we open the Python compiler and then load iris and digits dataset. Our symbol ' $ ' indicates a shell hint, ' >>> ' means Python compiler hint

$ python>>> fromimport datasets>>> iris = datasets.load_iris()>>> digits = datasets.load_digits()

A DataSet is a dictionary-like object that contains all the data and some metadata about the data. The data is stored in .data and is an n_samples,n_features array. In the case of supervisory issues, one or more category variables are stored in the. Target member. More details about the different datasets can be found in the dedicated section.

For example, in the case of a digits dataset, a digits.data sample can be used to classify a number.

>>> print (digits.data) [[0.   0.   5. ...,0.   0.   0.] [0.   0.   0. ...,.   0.   0.] [0.   0.   0. ...,.   9.   0.]..., [0.   0.   1. ...,6.   0.   0.] [0.   0.   2. ...,.   0.   0.] [0.   0.  . ...,.   1.   0.]]

And digits.target gives the real results of the digit datasets, which are numbers related to every digital image we are learning.

>>> digits.targetarray([012...898])

Shape of an array

Data is always some array of arrays, shape(n_samples,n_features) although the original data may have a different shape, for this digits, each of the original samples is a shape (8,8) image, and can be accessed using:

>>> digits.images[0]array([[  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.],       [  0.,   0.,  13.,  15.,  10.,  15.,   5.,   0.],       [  0.,   3.,  15.,   2.,   0.,  11.,   8.,   0.],       [  0.,   4.,  12.,   0.,   0.,   8.,   8.,   0.],       [  0.,   5.,   8.,   0.,   0.,   9.,   8.,   0.],       [  0.,   4.,  11.,   0.,   1.,  12.,   7.,   0.],       [  0.,   2.,  14.,   5.,  10.,  12.,   0.,   0.],       [  0.,   0.,   6.,  13.,  10.,   0.,   0.,   0.]])

Simple example the dataset this dataset shows how to start making data from the original problem in Scikit-learn.

Learning and Forecasting

In the digits dataset, given a digital image of a handwritten number, the task is to predict the result. We have given a sample of 10 categories (numbers 0 to 9), based on which we build an estimation method that predicts which of the samples we have not seen belong.

In Scikit-learn, the estimation model for classification is a Python object that implements the Fit (x, Y) method and the Predict (T) method.

An example of an estimation model is the class Sklearn.svm.SVC that implements the support vector machine classification. The constructor of the model is estimated to have model parameters, but for now we will estimate the model as a black box.

>>> fromimport svm  >>> clf = svm.SVC(gamma=0.001, C=100.)

Select model Parameters

In this example, we set the gamma value. You can automatically find the best parameter values by using grid search and cross-validation

We named our evaluation Model CLF, which, as a classifier, must now fit this model, i.e. it must learn from this model. We do this by passing the data set to the FIT function. As a training set, in addition to the last sample, we select all the remaining samples. Select the sample from the Python statement [:-1] , which will produce a new array from the Digits.data in addition to the last sample.

clf.fit(digits.data[:-1], digits.target[:-1])    SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,    gamma=0.001, kernel=‘rbf‘, max_iter=-1, probability=False,    random_state=None, shrinking=True, tol=0.001, verbose=False)

Now, we can predict the new value, especially when we can ask the last data that the classifier did not use to train the classifier in the digits dataset is a number:

The corresponding image is as follows:

As you can see, this is a challenging task: the resolution of the image is very low. Do you agree with this classifier?

A complete classification problem instance can be downloaded using the link below as an example of your running and learning recognizing hand-written digits

Model Persistence

You can save a model in Scikit by using the Python built-in persistence model, naming pickle:

>>> fromSklearnImportSvm>>> fromSklearnImportDatasets>>>CLF = SVM. SVC ()>>>Iris = Datasets.load_iris ()>>>X, y = iris.data, iris.target>>>Clf.fit (X, y) SVC (c=1.0, cache_size= $, class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, gamma=' Auto ', kernel=' RBF ', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)>>>ImportPickle>>>s = pickle.dumps (CLF)>>>CLF2 = Pickle.loads (s)>>>Clf2.predict (x[0]) Array ([0])>>>y[0]0

In the special case of Scikit, replacing Pickle (Joblib.dump & Joblib.load) with Joblib is more interesting, it is more efficient on big data, but it can only be stored in dictionaries rather than strings.

>>> fromimport joblib>>> ‘filename.pkl‘

Then you can read the above pickled model used (usually in other Python programs):

>>> clf = joblib.load(‘filename.pkl‘

Convention

Scikit-learn estimates some specific rules are more predictive of classifiers

Type Casting Conversion

Unless specifically specified, the input format isfloat64

>>>ImportNumPy asNp>>> fromSklearnImportRandom_projection>>>RNG = Np.random.RandomState (0)>>>X = Rng.rand (Ten, -)>>>x = Np.array (x, dtype=' float32 ')>>>X.dtypedtype (' float32 ')>>>Transformer = random_projection. Gaussianrandomprojection ()>>>X_new = Transformer.fit_transform (X)>>>X_new.dtypedtype (' float64 ')

In this case, X it is float32 , by Fit_transform (X), convert it tofloat64

The output value of the regression is float64 , also categorized:

>>> fromSklearnImportDatasets>>> fromSklearn.svmImportSvc>>>Iris = Datasets.load_iris ()>>>CLF = SVC ()>>>Clf.fit (Iris.data, Iris.target) SVC (c=1.0, cache_size= $, class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, gamma=' Auto ', kernel=' RBF ', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)>>>List (Clf.predict (iris.data[:3]))[0,0,0]>>>Clf.fit (Iris.data, Iris.target_names[iris.target]) SVC (c=1.0, cache_size= $, class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, gamma=' Auto ', kernel=' RBF ', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)>>>List (Clf.predict (iris.data[:3]))  [' Setosa ',' Setosa ',' Setosa ']

Here, the first predict() return is an array of integers, because in fitting iris.target (an array of integers), the second one predict returns an array of strings, because the one used to fit is iris.target_names .

Supplementary

Introduce a handy python IDE:

winpython:winpython_2.7

This column of machine learning continuous update, welcome attention: Pick-Yi Sina Weibo

Machine learning Scikit-learn Getting Started Tutorial

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More