Python implementation of K-nearest neighbor algorithm: source code Analysis

Last Update:2014-10-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The introduction of the K-nearest neighbor algorithm is many examples, its Python implementation version is basically from the beginning of machine learning book "Machine learning Combat", although the K-nearest neighbor algorithm itself is very simple, but many beginners to its Python version of the source code understanding is not enough, so this article will be the source of the analysis.

What is the K-nearest neighbor algorithm?

Simply put, the K-nearest neighbor algorithm uses the distance method between different eigenvalue values to classify. So it's a classification algorithm.

Pros: No data input assumptions, insensitive to outliers

Cons: High degree of complexity

All right, just the code, and then the analysis: (This code comes from "machine learning Combat")

def classify0 (Inx, DataSet, Lables, K):    datasetsize = dataset.shape[0]    Diffmat = Tile (Inx, (datasetsize, 1))-Dat Aset    Sqdiffmat = diffmat**2    sqdistance = sqdiffmat.sum (Axis=1)    distances = sqdistance**0.5    Sorteddistances = Distances.argsort ()    classcount={} for    I in range (k):        label = Lables[sorteddistances[i]]        Classcount[label] = classcount.get (label, 0) + 1    sortedclasscount = sorted (Classcount.iteritems (), key= Operator.itemgetter (1), reverse=true)    return sortedclasscount[0][0]

The principle of this function is:

There is a collection of sample data, also known as a training set, that has labels for each data in the sample set. After we enter new data with no labels, each feature of the new data is compared with the corresponding feature in the sample set, and then the most similar (nearest neighbor) category label is extracted. In general we only select the first k most similar data in the sample data set. Finally, the most frequently occurring classification is the classification of new data.

The parameter meanings of the CLASSIFY0 function are as follows:

INX: Is the input of new data without a label, expressed as a vector.

DataSet: is a sample set. Represented as a vector array.

Labels: the label of the corresponding Swatch set.

K: That is, the selected top K.

Simple function to produce a sample of data:

Def create_dataset ():    group = Array ([[1.0, 1.1], [1.0, 1.1], [0, 0], [0, 0.1]])    labels = [' A ', ' a ', ' B ', ' B ']    Return group, Labels

Notice that the array is inside the numpy. We need to implement import in.

From NumPy import *import operator

When we call,

Group,labels = Create_dataset () result = Classify0 ([0,0], group, labels, 3) print result

Obviously, the [0,0] eigenvector is definitely a B, and the above will print B.

Knowing this, beginners should be very unfamiliar with the actual code. No hurry, the text begins!

SOURCE Analysis

Datasetsize = dataset.shape[0]

Shape is the property of an array that describes the "shape", or its dimensions, of a pattern. Like what

In [2]: DataSet = Array ([[1.0, 1.1], [1.0, 1.1], [0, 0], [0, 0.1]]) in [3]: Print Dataset.shape (4, 2)

So, dataset.shape[0] is the number of sample sets.

Diffmat = Tile (Inx, (datasetsize, 1))-DataSet

The tile (A,REP) function constructs an array based on array a, depending on how the second parameter is constructed. Its API description is a bit around, but simple usage can be understood in a few examples.

Let's look at the results of tile (Inx, (4, 1)),

In [5]: Tile (x, (4, 1)) Out[5]: Array ([[[0, 0],       [0, 0], [0, 0]       ,       [0, 0]])

You see, 4 expands the number of arrays (originally 1, now 4), and 1 expands on the number of elements per array (originally 2, now two).

To confirm the above conclusion,

In [6]: Tile (x, (4,2)) out[6]: Array ([[0, 0, 0, 0],       [0, 0, 0, 0],       [0, 0, 0, 0],       [0, 0, 0, 0]])

And

In [7]: Tile (x, (2,2)) out[7]: Array ([[[0, 0, 0, 0],       [0, 0, 0, 0]])

For specific usage of tile, please consult API DOC yourself.

After you get the tile, subtract the dataset. This is similar to the subtraction of a matrix, and the result is still an array of 4 * 2.

In [8]: Tile (x, (4, 1))-datasetout[8]: Array ([[ -1., -1.1],       [ -1., -1.1], [0.,  0.], [0.       ,-0.1]])

In combination with the Euclidean distance, the following code is clear, and the square operation of the result, sum, and prescribe.

Let's look at the summation method,

Sqdiffmat.sum (Axis=1)

which

In []: sqdiffmatout[14]: Array ([[[1]  ,  1.21],       [1.  ,  1.21],       [0.  ,  0.  ],       [ 0.  ,  0.01]])

The result of summing is the sum of rows, which is an array of n*1.

If you want to sum the columns,

Sqldiffmat.sum (axis=0)

Argsort () is sorted in ascending order of the array.

ClassCount is a dictionary, key is a label, and value is the number of times the label appears.

In this way, some specific code details for the algorithm are clear.

Python implementation of K-nearest neighbor algorithm: source code Analysis

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python implementation of K-nearest neighbor algorithm: source code Analysis

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python implementation of K-nearest neighbor algorithm: source code Analysis

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support