KNN (K Nearest Neighbor) for Machine Learning Based on scikit-learn package-complete example, scikit-learnknn

Last Update:2017-04-09 Source: Internet

Author: User

Tags float number random seed

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

KNN (K Nearest Neighbor) for Machine Learning Based on scikit-learn package-complete example, scikit-learnknn

KNN (K Nearest Neighbor) for Machine Learning Based on scikit-learn package)

Scikit-learn (sklearn) is currently the most popular and powerful Python library for machine learning. It supports a wide range

Class, clustering, and regression analysis methods, such as support vector machine, random forest, and DBSCAN.

He has been welcomed by many data science practitioners and is also one of the industry's most famous open-source projects.

Based on the k-Nearest Neighbor principle described in the previous article, we mainly use the corresponding toolkit to implement machine learning. In order to gradually master such a successful toolkit, we

Start with simple KNN, among which scikit-learn is a good choice.

Install sklearn

First, we need to install the sklearn Library (because you can perform all the experiments on our web page, you can try again later on your computer

Install sklearn ). Sklearn is an extension library of Python, so we must first set up the Python runtime environment. At the same time, because sklearn is based on Numpy and

Scipy libraries must be installed first. Then, we can usepipOrcondaTo automatically install sklearn, the specific method is as follows:

# A newer version of Scipy and Numpy must be installed before sklearn is installed.
# Install sklearn using pip:
Pip install-U scikit-learn
# Use conda to install sklearn:
Conda install scikit-learn

After sklearn is installed, we can use various data and functions from sklearn In the Python script.

Sklearn built-in Dataset

Data is the key to machine learning. In machine learning, we need to spend a lot of time collecting and organizing data. reasonable and scientific data is good.

The key to machine learning performance. Generally, a machine learning process with a classification problem requires four pieces of data:

Training data, generally usedtrainTo indicate
The classification attribute of training data, which is generally usedtargetTo indicate
Test data, generally usedtestTo indicate
The real classification attribute of the test data, which is used to evaluate the performance of the classifier.expectedTo indicate

To facilitate learning and test various content in machine learning, sklearn has a variety of built-in useful datasets, such as text processing and image recognition.

The problematic data is collected in sklearn (user-friendly for beginners ).

The IRIS data set for KNN described in this article can also be used in sklearndatasetsModule.

KNN algorithm implementation

Not to mention, go directly to the code, and then explain it later.

#-*-Coding: UTF-8 -*-
From sklearn import datasets # import the built-in dataset Module
From sklearn. neighbors import KNeighborsClassifier # import KNN class in the sklearn. neighbors Module
Import numpy as np
Np. random. seed (0) # Set the random seed. If this parameter is not set, the system time is used as the parameter by default. Therefore, the random number generated each time a random module is called is different.
#. The settings are the same each time
Iris = datasets. load_iris () # import the iris dataset. iris is something similar to a struct with internal sample data. If it is supervised learning
# Tag data
Iris_x = iris. data # sample data: 150x4 two-dimensional data, representing 150 samples, each of which has four attributes: the length and width of the petals and the flowers
Iris_ypoliciris.tar get # labels of the array and sample data with a length of 150
Indices = np. random. permutation (len (iris_x) # permutation receives a number as the parameter (150) and generates a one-dimensional
# Arrays, but they are randomly disrupted. Of course, she can also receive a one-dimensional array
# As a parameter, the result is a direct disruption to this array
Iris_x_train = iris_x [indices [:-10] #140 samples are randomly selected as the training dataset.
Iris_y_train = iris_y [indices [:-10] # The tags of the 140 samples are selected as the training dataset labels.
Iris_x_test = iris_x [indices [-10:] # the remaining 10 samples are used as the test dataset.
Iris_y_test = iris_y [indices [-10:] # Use the labels of the remaining 10 samples as test data and labels.

Knn = KNeighborsClassifier () # defines a knn classifier object
Knn. fit (iris_x_train, iris_y_train) # Call the Training Method of this object and receive two parameters: the training dataset and Its sample tag.

Iris_y_predict = knn. predict (iris_x_test) # Call the test method of this object, mainly receiving a parameter: Test Dataset

Probility = knn. predict_proba (iris_x_test) # calculate the probability-based prediction of each test sample

Neighborpoint = knn. kneighbors (iris_x_test [-1], 5, False) # Calculate the nearest five points to the last test sample,
# The returned result is an array consisting of the serial numbers of these samples.

Score = knn. score (iris_x_test, iris_y_test, sample_weight = None) # Call the scoring method of this object to calculate the accuracy.


Print ('iris _ y_predict = ')
Print (iris_y_predict) # output test results

Print ('iris _ y_test = ')
Print (iris_y_test) # output the correct labels of the original test dataset for easy comparison

Print 'accuracy: ', score # output Accuracy Calculation Result

print 'neighborpoint of last test sample:',neighborpoint

print 'probility:',probility

***********************************
Result output:
Iris_y_predict =
[1 2 1 0 0 0 2 1 2 0]
Iris_y_test =
[1 1 0 0 0 2 1 2 0]
Accurity: 0.9
Neighborpoint of last test sample: [75 41 96 78 123]
Probility: [[0. 1. 0.]
[0. 0.4 0.6]
[0. 1. 0.]
[1. 0. 0.]
[1. 0. 0.]
[1. 0. 0.]
[0. 0. 1.]
[0. 1. 0.]
[0. 0. 1.]
[1. 0. 0.]

Based on this, the implementation of KNN algorithm has been completely completed. Of course, we still need to explain a little bit about the methods of the above two objects. One is the fit () method, one is predict ()

Method. I mainly accept two parameters. It does not mean that only two or one parameter is accepted, but that all other parameters have default values and are internal parameters. Here we will explain a little.

KNeighborsClassifier is a class that integrates NeighborsBase, KNeighborsMixin, SupervisedIntegerMixin, and ClassifierMixin. For now

It does not matter. It mainly involves several main methods. Of course, some methods are integrated from the parent class.

_ Init _ () the initialization function (constructor) has the following parameters:

N_neighbors = 5 The knn algorithm of the int parameter specifies that the nearest neighbor sample has the right to vote. The default parameter is 5.

Weights = 'uniform' str parameter, that is, the proportion of each voting sample, 'uniform' indicates the proportion of voting, and 'distance 'indicates the inverse distance.

Vote. [callable] indicates a defined function. This function receives a distance array and returns an array of weights. Default Parameter

The number is 'uniform'

The algrithm = 'auto' str parameter is the algorithm used internally. The following options are available: 'ball _ tree': ball tree, 'kd _ tree': kd tree, And 'brtue': brute force.

Search and 'auto': automatically selects an appropriate Algorithm Based on the Data Type and structure. The default value is 'auto '. Not to mention Brute Force Search

As we all know. The specific data structure of the first two trees depends on the situation. The KD tree is a pair of K-dimensional coordinate axes in sequence and is cut by the mean value.

The tree is a super rectangle and has the highest aging rate when the dimension is smaller than 20. For details, see chapter 2 of statistical learning method.

Ball tree was invented to overcome the high-dimensional failure of KD trees. Its construction process is to split the sample space by center C and radius r.

A vertex is a hyper-sphere. Generally, low-dimensional data uses kd_tree for fast speeds and ball_tree for low speeds. High-dimensional data with more than 20 dimensions

The effect of kd_tree is not good, but that of ball_tree is better. If you are interested in the construction process and the theory of advantages and disadvantages, you can learn more.

The leaf_size = 30 int parameter is based on the algorithm described above. This parameter provides the size of the kd_tree or ball_tree leaf node. different sizes of the leaf node will affect the structure of the number.

The creation and search speed also affects the memory size of the storage tree. The specific optimal scale depends on the situation.

Matric = 'minkey' str or how to measure the distance from a measurement object. The default distance is Min's distance. Min's distance is not a specific distance measurement method.

The distance measurement method is a promotion of other distance measurements. The specific distance measurements are only the values of parameter p or whether the distance is exceeded.

Details:

The p = 2 int parameter is the distance between the preceding Min's parameters. The default value is 2, that is, the Euclidean distance. P = 1 represents the distance between Manhattan and so on.

Metric_params = additional keyword parameter of the None distance measurement function. Generally, this parameter is left empty. The default value is None.

The n_jobs = 1 int parameter indicates the number of parallel computing threads. The default value 1 indicates a thread. If the value is-1, it indicates the number of CPU cores. You can also specify the number of other threads.

Thread. You don't need to worry about speed. If you need to use it, check the multithreading.

Fit () training function, which is the main function. There is only one received parameter, which is the training dataset. Each row is a sample and each column is an attribute. It returns the object itself,

That is, you only need to modify the internal properties of the object, so you can call it directly. The prediction function of the object is used later to obtain the prediction result and use the training result. Actually, this letter

The number is not a method of the KNeighborsClassifier class, but a method inherited by its parent class SupervisedIntegerMixin.

The predict () prediction function receives input array-type test samples, which are generally two-dimensional arrays. Each row is a sample and each column is an attribute.

Returns the prediction result of the array type. If each sample has only one output, the output is a one-dimensional array. If the output of each sample is multidimensional, the output is two-dimensional.

Array. Each row is a sample, and each column is a one-dimensional output.

Predict_prob () is a probability-based soft decision. It is also a prediction function, but it does not give the output value of a sample, but the probability of the output being a variety of possible values.

The receiving parameters are the same as those above.

The return parameter is similar to the preceding one, except that all values above are replaced with probabilities. For example, if the output result is 0 or 1, the preceding prediction function returns a long value.

A one-dimensional array of n indicates whether the output of each sample is 0 or 1 at a time. If the probability prediction function is used, a two-dimensional array of n * 2 is returned, and each row represents a sample,

Each row has two numbers, indicating the probability of the sample output being 0 and the probability of Output 1. Various possible sequences are arranged in alphabetical order, such as 0 and 1,

Or in other cases, they are arranged in alphabetical order.

Score () is a function used to calculate the accuracy. Three parameters are accepted. X: test sample of the array type that receives input, generally a two-dimensional array. Each row is a sample, and each column is an attribute.

Y: X real tags of these prediction samples, one-dimensional array or two-dimensional array. Sample_weight = None, which is an influence on the accuracy of each sample with the same length as the first X.

The weight. The default value is None.

The output is a float number, indicating the accuracy.

Internal computation is based on the results of predict () function computing.

In fact, this function is not a method of the KNeighborsClassifier class, but a method inherited by its parent class KNeighborsMixin.

Kneighbors () calculates the nearest neighbor training samples of some test samples. Receives three parameters. X = None: target sample for which the nearest neighbor is to be searched. N_neighbors = None,

Indicates the nearest neighbor samples of the target sample. The default value is None. Return_distance = True: whether to return

Body distance value.

Returns the serial number of the nearest neighbor in the training sample.

In fact, this function is not a method of the KNeighborsClassifier class, but a method inherited by its parent class KNeighborsMixin.

Let's take a look at the following simple example (in fact, the above example is very specific ):

Examples
--------
>>> X = [[0], [1], [2], [3]]
>>> y = [0, 0, 1, 1]
>>> from sklearn.neighbors import KNeighborsClassifier
>>> neigh = KNeighborsClassifier(n_neighbors=3)
>>> neigh.fit(X, y) 
>>> print(neigh.predict([[1.1]]))
[0]
>>> print(neigh.predict_proba([[0.9]]))
[[ 0.66666667  0.33333333]]

I have explained the knn algorithm in sklearn. I wonder if you have learned it? In addition to the sklearn background
For originality, We hope everyone can learn and make progress together, and we will continue to update sklearn's other methods in the future.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More