Python Implementation of K-Nearest Neighbor Algorithm: Source Code Analysis

Source: Internet
Author: User

Python Implementation of K-Nearest Neighbor Algorithm: Source Code Analysis

Many examples of K-Nearest Neighbor algorithms are introduced online. The Python implementation version is basically from the machine learning getting started book "Machine Learning Practice". Although K-Nearest Neighbor algorithms are simple, however, many beginners do not understand the source code of their Python version, so this article will analyze the source code.


What is a K-Nearest Neighbor Algorithm?

In short, the k-nn algorithm uses the distance method between different feature values for classification. Therefore, it is a classification algorithm.

Advantage: No data input assumption, not sensitive to abnormal values

Disadvantage: high complexity


Okay, go directly to the code and wait for analysis: (this code comes from "machine learning practices")

def classify0(inx, dataset, lables, k):    dataSetSize = dataset.shape[0]    diffMat = tile(inx, (dataSetSize, 1)) - dataset    sqDiffMat = diffMat**2    sqDistance = sqDiffMat.sum(axis=1)    distances = sqDistance**0.5    sortedDistances = distances.argsort()    classCount={}    for i in range(k):        label = lables[sortedDistances[i]]        classCount[label] = classCount.get(label, 0) + 1    sortedClassCount = sorted(classCount.iteritems(),key=operator.itemgetter(1), reverse=True)    return sortedClassCount[0][0]

The principle of this function is:

There is a sample data set, also known as a training set. Each data in the sample set has tags. After we input new data without tags, compare each feature of the new data with the features corresponding to the sample set, and then extract the most similar (nearest neighbor) classification tag. Generally, we only select the first K most similar data in the sample dataset. Finally, the most frequently-occurring category is the category of new data.


The parameter Meanings of the classify0 function are as follows:

Dataset: A sample set. Is a Vector Array.

Labels: label of the sample set.

K: The first K.


Simple functions used to generate data samples:


def create_dataset():    group = array([[1.0, 1.1], [1.0, 1.1], [0, 0], [0, 0.1]])    labels = ['A', 'A', 'B', 'B']    return group, labels


Note that array is in numpy. We need to implement import.

from numpy import *import operator


When we call,

group,labels = create_dataset()result = classify0([0,0], group, labels, 3)print result

Obviously, [0, 0] feature vectors must belong to B, and B will be printed above.


With this knowledge, Beginners should still be unfamiliar with the actual code. No, the text is starting!


Source code analysis


dataSetSize = dataset.shape[0]

Shape is an attribute of array. It describes the "shape" of an array, that is, its dimension. For example,

In [2]: dataset = array([[1.0, 1.1], [1.0, 1.1], [0, 0], [0, 0.1]])In [3]: print dataset.shape(4, 2)

Therefore, dataset. shape [0] is the number of sample sets.


diffMat = tile(inx, (dataSetSize, 1)) - dataset

The tile (A, rep) function constructs an Array Based on Array A. The second parameter is used to construct the array. Its API introduction is a bit difficult, but the simple usage can be understood by several examples.

Let's take a look at the results of tile (partial, (4, 1,

In [5]: tile(x, (4, 1))Out[5]: array([[0, 0],       [0, 0],       [0, 0],       [0, 0]])

As you can see, 4 expands the number of arrays (originally one, now four), and 1 expands the number of each array element (originally two, but now two ).

To confirm the above conclusion,

In [6]: tile(x,(4,2))Out[6]: array([[0, 0, 0, 0],       [0, 0, 0, 0],       [0, 0, 0, 0],       [0, 0, 0, 0]])

And,

In [7]: tile(x,(2,2))Out[7]: array([[0, 0, 0, 0],       [0, 0, 0, 0]])

For more information about how to use tile, see the api doc.


After obtaining the tile, subtract dataset. This is similar to the subtraction of a matrix, and the result is still a 4*2 array.

In [8]: tile(x, (4, 1)) - datasetOut[8]: array([[-1. , -1.1],       [-1. , -1.1],       [ 0. ,  0. ],       [ 0. , -0.1]])

Combined with the method of Euclidean distance, the subsequent code is clearer. The above result is calculated by Square, sum, and square.

Let's look at the summation method,

sqDiffMat.sum(axis=1)

Where,

In [14]: sqDiffmatOut[14]: array([[ 1.  ,  1.21],       [ 1.  ,  1.21],       [ 0.  ,  0.  ],       [ 0.  ,  0.01]])


The result of the sum is the sum of rows, which is an array of N * 1.

To sum a column,

sqlDiffMat.sum(axis=0)

Argsort () sorts arrays in ascending order.


ClassCount is a dictionary, key is a tag, and value is the number of times the tag appears.


In this way, the detailed code of the algorithm is clear.




Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.