Python implementation of K-nearest neighbor algorithm: source code Analysis

Source: Internet
Author: User

The introduction of the K-nearest neighbor algorithm is many examples, its Python implementation version is basically from the beginning of machine learning book "Machine learning Combat", although the K-nearest neighbor algorithm itself is very simple, but many beginners to its Python version of the source code understanding is not enough, so this article will be the source of the analysis.


What is the K-nearest neighbor algorithm?

Simply put, the K-nearest neighbor algorithm uses the distance method between different eigenvalue values to classify. So it's a classification algorithm.

Pros: No data input assumptions, insensitive to outliers

Cons: High degree of complexity


All right, just the code, and then the analysis: (This code comes from "machine learning Combat")

def classify0 (Inx, DataSet, Lables, K):    datasetsize = dataset.shape[0]    Diffmat = Tile (Inx, (datasetsize, 1))-Dat Aset    Sqdiffmat = diffmat**2    sqdistance = sqdiffmat.sum (Axis=1)    distances = sqdistance**0.5    Sorteddistances = Distances.argsort ()    classcount={} for    I in range (k):        label = Lables[sorteddistances[i]]        Classcount[label] = classcount.get (label, 0) + 1    sortedclasscount = sorted (Classcount.iteritems (), key= Operator.itemgetter (1), reverse=true)    return sortedclasscount[0][0]

The principle of this function is:

There is a collection of sample data, also known as a training set, that has labels for each data in the sample set. After we enter new data with no labels, each feature of the new data is compared with the corresponding feature in the sample set, and then the most similar (nearest neighbor) category label is extracted. In general we only select the first k most similar data in the sample data set. Finally, the most frequently occurring classification is the classification of new data.


The parameter meanings of the CLASSIFY0 function are as follows:

INX: Is the input of new data without a label, expressed as a vector.

DataSet: is a sample set. Represented as a vector array.

Labels: the label of the corresponding Swatch set.

K: That is, the selected top K.


Simple function to produce a sample of data:


Def create_dataset ():    group = Array ([[1.0, 1.1], [1.0, 1.1], [0, 0], [0, 0.1]])    labels = [' A ', ' a ', ' B ', ' B ']    Return group, Labels


Notice that the array is inside the numpy. We need to implement import in.

From NumPy import *import operator


When we call,

Group,labels = Create_dataset () result = Classify0 ([0,0], group, labels, 3) print result

Obviously, the [0,0] eigenvector is definitely a B, and the above will print B.


Knowing this, beginners should be very unfamiliar with the actual code. No hurry, the text begins!


SOURCE Analysis


Datasetsize = dataset.shape[0]

Shape is the property of an array that describes the "shape", or its dimensions, of a pattern. Like what

In [2]: DataSet = Array ([[1.0, 1.1], [1.0, 1.1], [0, 0], [0, 0.1]]) in [3]: Print Dataset.shape (4, 2)

So, dataset.shape[0] is the number of sample sets.


Diffmat = Tile (Inx, (datasetsize, 1))-DataSet

The tile (A,REP) function constructs an array based on array a, depending on how the second parameter is constructed. Its API description is a bit around, but simple usage can be understood in a few examples.

Let's look at the results of tile (Inx, (4, 1)),

In [5]: Tile (x, (4, 1)) Out[5]: Array ([[[0, 0],       [0, 0], [0, 0]       ,       [0, 0]])

You see, 4 expands the number of arrays (originally 1, now 4), and 1 expands on the number of elements per array (originally 2, now two).

To confirm the above conclusion,

In [6]: Tile (x, (4,2)) out[6]: Array ([[0, 0, 0, 0],       [0, 0, 0, 0],       [0, 0, 0, 0],       [0, 0, 0, 0]])

And

In [7]: Tile (x, (2,2)) out[7]: Array ([[[0, 0, 0, 0],       [0, 0, 0, 0]])

For specific usage of tile, please consult API DOC yourself.


After you get the tile, subtract the dataset. This is similar to the subtraction of a matrix, and the result is still an array of 4 * 2.

In [8]: Tile (x, (4, 1))-datasetout[8]: Array ([[ -1., -1.1],       [ -1., -1.1], [0.,  0.], [0.       ,-0.1]])

In combination with the Euclidean distance, the following code is clear, and the square operation of the result, sum, and prescribe.

Let's look at the summation method,

Sqdiffmat.sum (Axis=1)

which

In []: sqdiffmatout[14]: Array ([[[1]  ,  1.21],       [1.  ,  1.21],       [0.  ,  0.  ],       [ 0.  ,  0.01]])


The result of summing is the sum of rows, which is an array of n*1.

If you want to sum the columns,

Sqldiffmat.sum (axis=0)

Argsort () is sorted in ascending order of the array.


ClassCount is a dictionary, key is a label, and value is the number of times the label appears.


In this way, some specific code details for the algorithm are clear.




Python implementation of K-nearest neighbor algorithm: source code Analysis

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.