1. Scikit-learn Introduction
Scikit-learn is an open-source machine learning module for Python, built on numpy,scipy and matplotlib modules. It is worth mentioning that Scikit-learn was first launched by David Cournapeau in 2007, a Google Summer of code project, since then the project has been a lot of contributors, And the project has been maintained by a team of volunteers so far.
Scikit-learn's biggest feature is the ability to provide users with a variety of machine learning algorithm interfaces that enable users to easily and efficiently perform data mining and data analysis .
Scikit-learn Home: Scikit-learn homepage
2. Scikit-learn installation
Scikit-learn installation methods are many, but also applicable to a variety of mainstream operating systems, Scikit-learn home page is also described in detail in different operating systems under the three kinds of installation methods, specific installation details please move to installing Scikit-learn.
Here, first of all, we recommend a powerful development environment python (x, y) to learn python. Python (x, y) is a Python-based scientific computing package that contains the integrated development environment Eclipse and Python development plug-in Pydev, data interactive editing and visual Tools Spyder, It also incorporates Python's basic database NumPy and Advanced Math Library scipy, the 3D visualizer set Mayavi, the Python Interface Development Library PYQT, Python and the C/D + + hybrid compiler swig. In addition, Python (x, y) is equipped with a full range of help documents, which is very convenient for researchers to use.
For the students like the landlord, in the school accustomed to using MATLAB simulation for scientific research, Python (x, y) is a great choice for learning Python, in which the embedded Spyder provides an interactive interface similar to MATLAB, can be easily used. For download of Python (x, y) click here: Python (x, y) download.
Since Scikit-learn is based on the NumPy, scipy, and matplotlib modules, it is cumbersome to install the 3 modules before installing Scikit-learn. However, if you have installed Python (x, y) in advance like a landlord, it already contains the above-mentioned modules, just download the Scikit-learn version that matches you and click Install directly.
Scikit-learn various versions download: Scikit-learn download.
3. Scikit-learnGta5-InData Set
The Scikit-learn contains commonly used machine learning datasets, such as the iris and digit datasets for classification, the classic dataset for regression Boston house prices.
Scikit-learn Loading Data Set instances:
from sklearn import datasets
iris = datasets.load_iris()
The data set loaded by Scikit-learn is stored in a dictionary-like form that contains all the data (and even references) about that data. The data values are uniformly stored in the members of the. Data, for example, we want to display the iris data, just display the data member of Iris:
Print Iris.data
Data are stored and presented in n-dimensional (n-feature) matrices, with 4-dimensional features for each instance of the iris data: sepal length, sepal width, petal length, and petal width. Show Iris Data:
[[ 5.1 3.5 1.4 0.2]
[ 4.9 3. 1.4 0.2]
... ...
[ 5.9 3. 5.1 1.8]]
If it is for supervised learning, such as classification problems, the data will contain the corresponding classification results, which exist. Target Members:
Print Iris.target
For Iris data, it is the classification result of each instance:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
4. Scikit-learn Learning and Forecasting
Scikit-learn provides an interface for various machine learning algorithms that allow users to easily use them. The invocation of each algorithm is like a black box, for the user, we just need to set the corresponding parameters according to their own needs.
For example, call the most commonly used support vector classifier (SVC):
The from sklearn import SVM
CLF = svm.svc (gamma=0.001, C=100.) # does not want to use default parameters, using the parameters given by the user
Print CLF
Specific information and parameters for the classifier:
SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=0.001, kernel=‘rbf‘, max_iter=-1, probability=False,
random_state=None, shrinking=True, tol=0.001, verbose=False)
Classifier learning and prediction can be achieved by using fit (x, y) and predict (T), respectively.
For example, digit data is divided into training sets and test sets, the first n-1 is the training set, and the last is the test set (this is just an example of the use of fit and predict functions). Then use fit and predict to complete the learning and prediction separately, the code is as follows:
from sklearn import datasets from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100.)
digits = datasets.load_digits()
clf.fit(digits.data[:-1], digits.target[:-1])
result=clf.predict(digits.data[-1]) print result
The predicted result is: [8]
We can use the program to see what the handwriting instances in the test set look like to simply verify the classification effect, and the code and results are as follows:
import matplotlib.pyplot as plot
plot.figure(1, figsize=(3, 3))
plot.imshow(digits.images[-1], cmap=plot.cm.gray_r, interpolation=‘nearest‘)
plot.show()
The last handwriting instance is:
We can see that this is a handwritten number "8", in fact the correct classification is also "8". In this simple example, we are simply learning how to use Scikit-learn to solve classification problems, which is actually much more complex. (PS: Learning is gradual, to understand an example, will understand the second, ..., then is the nth, and finally will form their own knowledge and theory, you can easily grasp, to solve all kinds of complex problems encountered. )
and show you a scikit-learn solution to the digit classification (handwriting recognition) program (by Gael Varoquaux , I believe that we have seen this program we will certainly have a certain understanding and understanding of the Scikit-learn machine learning Library.
import matplotlib.pyplot as plt
# Import datasets, classifiers and performance metrics
from sklearn import datasets, svm, metrics
# The digits dataset
digits = datasets.load_digits()
# The data that we are interested in is made of 8x8 images of digits, let‘s
# have a look at the first 3 images, stored in the `images` attribute of the
# dataset. If we were working from image files, we could load them using
# pylab.imread. Note that each image must have the same size. For these
# images, we know which digit they represent: it is given in the ‘target‘ of
# the dataset.
images_and_labels = list(zip(digits.images, digits.target))
for index, (image, label) in enumerate(images_and_labels[:4]):
plt.subplot(2, 4, index + 1)
plt.axis(‘off‘)
plt.imshow(image, cmap=plt.cm.gray_r, interpolation=‘nearest‘)
plt.title(‘Training: %i‘ % label)
# To apply a classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))
# Create a classifier: a support vector classifier
classifier = svm.SVC(gamma=0.001)
# We learn the digits on the first half of the digits
classifier.fit(data[:n_samples / 2], digits.target[:n_samples / 2])
# Now predict the value of the digit on the second half:
expected = digits.target[n_samples / 2:]
predicted = classifier.predict(data[n_samples / 2:])
print("Classification report for classifier %s:\n%s\n"
% (classifier, metrics.classification_report(expected, predicted)))
print("Confusion matrix:\n%s" % metrics.confusion_matrix(expected, predicted))
images_and_predictions = list(zip(digits.images[n_samples / 2:], predicted))
for index, (image, prediction) in enumerate(images_and_predictions[:4]):
plt.subplot(2, 4, index + 5)
plt.axis(‘off‘)
plt.imshow(image, cmap=plt.cm.gray_r, interpolation=‘nearest‘)
plt.title(‘Prediction: %i‘ % prediction)
plt.show()
Output Result:
Classification report for classifier SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=0.001, kernel=‘rbf‘, max_iter=-1, probability=False,
random_state=None, shrinking=True, tol=0.001, verbose=False):
precision recall f1-score support
0 1.00 0.99 0.99 88
1 0.99 0.97 0.98 91
2 0.99 0.99 0.99 86
3 0.98 0.87 0.92 91
4 0.99 0.96 0.97 92
5 0.95 0.97 0.96 91
6 0.99 0.99 0.99 91
7 0.96 0.99 0.97 89
8 0.94 1.00 0.97 88
9 0.93 0.98 0.95 92 avg / total 0.97 0.97 0.97 899 Confusion matrix:
[[87 0 0 0 1 0 0 0 0 0]
[ 0 88 1 0 0 0 0 0 1 1]
[ 0 0 85 1 0 0 0 0 0 0]
[ 0 0 0 79 0 3 0 4 5 0]
[ 0 0 0 0 88 0 0 0 0 4]
[ 0 0 0 0 0 88 1 0 0 2]
[ 0 1 0 0 0 0 90 0 0 0]
[ 0 0 0 0 0 1 0 88 0 0]
[ 0 0 0 0 0 0 0 0 88 0]
[ 0 0 0 1 0 1 0 0 0 90]]
5. Summary
1) Introduction and installation of Scikit-learn;
2) have a general understanding of Scikit-learn, can try to use Scikit-learn for data mining and analysis.
6. Reference Content
[1] An introduction to machine learning with Scikit-learn
[2] Machine learning combat
[Python & Machine Learning] Learning notes Scikit-learn Machines Learning Library