Original link: http://scikit-learn.github.io/dev/tutorial/basic/tutorial.html
Chapter Content
In this chapter, we mainly introduce the Scikit-learn machine learning Thesaurus, and will give you a learning sample.
Machine Learning: Problem setting
In general, a learning problem is learning through a series of n sample data and then trying to predict the properties of unknown data. If each sample exceeds a single value, such as multidimensional input (also called multi-dimensional data), then it has multiple characteristics.
We can divide the learning problem into several big ones:
- Supervised learning: In supervised learning, this data comes with the additional attributes we want to predict (Scikit-learn supervised learning links), which include:
- Classification: A sample belongs to two or more classes, and we want to predict the category of unknown data from the data that has been tagged. An example of a classification problem is hand writing recognition. The purpose of this example is to identify the category of input vectors from some categories. Another idea for classification is as a separate form of supervised learning (not sequential), in which one is a restricted number of categories, and there are n samples provided for each category, and one is to try to tag them with the correct category or class.
- Regression: If the desired output is made up of one or more contiguous variables, it is called regression. Examples of regression problems would predict the length of a salmon by its age and weight.
- Unsupervised learning: In unsupervised learning, the training data consists of a set of input vectors that do not have any category tag values. The purpose of this problem is that similar sample groups may be found in these data, and these similar examples are called clusters. Or in the input space to determine the distribution of data, called density estimation, or the data from the high-dimensional space mapping to two-dimensional or three-dimensional space, called the data visualization problem. (Unsupervised learning links)
Training sets and test sets
Machine learning is about learning some of the properties of a dataset and then applying them to new data. This is why it is common practice to evaluate an algorithm in machine learning by splitting the dataset into two datasets, one of which is called the training set, which is used to learn the properties of the data, and the other is called the test set, which tests those properties on the test set.
loading a sample data set
Scikit-learn comes with some standard datasets, such as the iris and digit datasets for classification and the Boston house prices dataset for regression.
Below, we open the Python compiler and then load iris
and digits
dataset. Our symbol ' $ ' indicates a shell hint, ' >>> ' means Python compiler hint
$ python>>> fromimport datasets>>> iris = datasets.load_iris()>>> digits = datasets.load_digits()
A DataSet is a dictionary-like object that contains all the data and some metadata about the data. The data is stored in .data
and is an n_samples,n_features
array. In the case of supervisory issues, one or more category variables are stored in the. Target member. More details about the different datasets can be found in the dedicated section.
For example, in the case of a digits dataset, a digits.data
sample can be used to classify a number.
>>> print (digits.data) [[0. 0. 5. ...,0. 0. 0.] [0. 0. 0. ...,. 0. 0.] [0. 0. 0. ...,. 9. 0.]..., [0. 0. 1. ...,6. 0. 0.] [0. 0. 2. ...,. 0. 0.] [0. 0. . ...,. 1. 0.]]
And digits.target
gives the real results of the digit datasets, which are numbers related to every digital image we are learning.
>>> digits.targetarray([012...898])
Shape of an array
Data is always some array of arrays, shape(n_samples,n_features)
although the original data may have a different shape, for this digits, each of the original samples is a shape (8,8) image, and can be accessed using:
>>> digits.images[0]array([[ 0., 0., 5., 13., 9., 1., 0., 0.], [ 0., 0., 13., 15., 10., 15., 5., 0.], [ 0., 3., 15., 2., 0., 11., 8., 0.], [ 0., 4., 12., 0., 0., 8., 8., 0.], [ 0., 5., 8., 0., 0., 9., 8., 0.], [ 0., 4., 11., 0., 1., 12., 7., 0.], [ 0., 2., 14., 5., 10., 12., 0., 0.], [ 0., 0., 6., 13., 10., 0., 0., 0.]])
Simple example the dataset this dataset shows how to start making data from the original problem in Scikit-learn.
Learning and Forecasting
In the digits dataset, given a digital image of a handwritten number, the task is to predict the result. We have given a sample of 10 categories (numbers 0 to 9), based on which we build an estimation method that predicts which of the samples we have not seen belong.
In Scikit-learn, the estimation model for classification is a Python object that implements the Fit (x, Y) method and the Predict (T) method.
An example of an estimation model is the class Sklearn.svm.SVC that implements the support vector machine classification. The constructor of the model is estimated to have model parameters, but for now we will estimate the model as a black box.
>>> fromimport svm >>> clf = svm.SVC(gamma=0.001, C=100.)
Select model Parameters
In this example, we set the gamma value. You can automatically find the best parameter values by using grid search and cross-validation
We named our evaluation Model CLF, which, as a classifier, must now fit this model, i.e. it must learn from this model. We do this by passing the data set to the FIT function. As a training set, in addition to the last sample, we select all the remaining samples. Select the sample from the Python statement [:-1]
, which will produce a new array from the Digits.data in addition to the last sample.
clf.fit(digits.data[:-1], digits.target[:-1]) SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.001, kernel=‘rbf‘, max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)
Now, we can predict the new value, especially when we can ask the last data that the classifier did not use to train the classifier in the digits dataset is a number:
The corresponding image is as follows:
As you can see, this is a challenging task: the resolution of the image is very low. Do you agree with this classifier?
A complete classification problem instance can be downloaded using the link below as an example of your running and learning recognizing hand-written digits
Model Persistence
You can save a model in Scikit by using the Python built-in persistence model, naming pickle:
>>> fromSklearnImportSvm>>> fromSklearnImportDatasets>>>CLF = SVM. SVC ()>>>Iris = Datasets.load_iris ()>>>X, y = iris.data, iris.target>>>Clf.fit (X, y) SVC (c=1.0, cache_size= $, class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, gamma=' Auto ', kernel=' RBF ', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)>>>ImportPickle>>>s = pickle.dumps (CLF)>>>CLF2 = Pickle.loads (s)>>>Clf2.predict (x[0]) Array ([0])>>>y[0]0
In the special case of Scikit, replacing Pickle (Joblib.dump & Joblib.load) with Joblib is more interesting, it is more efficient on big data, but it can only be stored in dictionaries rather than strings.
>>> fromimport joblib>>> ‘filename.pkl‘
Then you can read the above pickled model used (usually in other Python programs):
>>> clf = joblib.load(‘filename.pkl‘
Convention
Scikit-learn estimates some specific rules are more predictive of classifiers
Type Casting Conversion
Unless specifically specified, the input format isfloat64
>>>ImportNumPy asNp>>> fromSklearnImportRandom_projection>>>RNG = Np.random.RandomState (0)>>>X = Rng.rand (Ten, -)>>>x = Np.array (x, dtype=' float32 ')>>>X.dtypedtype (' float32 ')>>>Transformer = random_projection. Gaussianrandomprojection ()>>>X_new = Transformer.fit_transform (X)>>>X_new.dtypedtype (' float64 ')
In this case, X
it is float32
, by Fit_transform (X), convert it tofloat64
The output value of the regression is float64
, also categorized:
>>> fromSklearnImportDatasets>>> fromSklearn.svmImportSvc>>>Iris = Datasets.load_iris ()>>>CLF = SVC ()>>>Clf.fit (Iris.data, Iris.target) SVC (c=1.0, cache_size= $, class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, gamma=' Auto ', kernel=' RBF ', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)>>>List (Clf.predict (iris.data[:3]))[0,0,0]>>>Clf.fit (Iris.data, Iris.target_names[iris.target]) SVC (c=1.0, cache_size= $, class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, gamma=' Auto ', kernel=' RBF ', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)>>>List (Clf.predict (iris.data[:3])) [' Setosa ',' Setosa ',' Setosa ']
Here, the first predict()
return is an array of integers, because in fitting iris.target
(an array of integers), the second one predict
returns an array of strings, because the one used to fit is iris.target_names
.
Supplementary
Introduce a handy python IDE:
winpython:winpython_2.7
This column of machine learning continuous update, welcome attention: Pick-Yi Sina Weibo
Machine learning Scikit-learn Getting Started Tutorial