What is the KNN algorithm?

Last Update:2015-05-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

What are the minimum requirements for learning machine learning? I have found that the requirements can be very low, even junior high level can already. First of all, learn a little Python programming, such as the two children's book: "1" "2" can be. Mathematically, you only need to know the "distance between two points" formula (middle school coordinates geometry will read).

The second chapter of the book describes the KNN algorithm, including the Python program:

Other chapters may have different mathematical requirements, but I want to show that many of the practical AI principles are actually very simple.

What is KNN? For example:

At the beginning, all data points labels (colors) are already known.

KNN to solve the problem is: the picture of that "? "What color should the label of the point be?"

Visually, with the naked eye, that "? "The location is in the dense area of the blue dots, so the most" appropriate "label should be Blue .

The KNN algorithm is:

In the known data points, point-and-click (each of these is called P):

First calculate "? "And the distance between the P

After all the distances are calculated, they are made from small to large sort good
From the sort good sequence, take the first k (i.e. distance closest "?). "K-Pips"
On this k-point, read out their label (color) is what this is the problem already known
All these labels (colors), which appear most? (that is to say, the closest "?) "K-pips, what color are they most commonly?" ）
The color that appears most often is the answer

For example, the consumer requires k = 5 o'clock:

Closest "? "5 points, is to"? "As the center of the dashed circle within the 5 points. Their color order is: [ ?,?,?,?,? ]. The number of occurrences is 4? 1 , so the highest number of colors is ?

How do I write KNN in Python?

1. Calculate the point with "? "Distance between

First, we recall the formula of distance between two points in the coordinate geometry of middle school. Assuming that the two points are a and B, their coordinates are $ (x_a, Y_a) $ and $ (X_b, Y_b) $, then:

$$ \mbox{distance D} = \sqrt{(X_a-x_b) ^2 + (y_a-y_b) ^2} $$

Note that in this formula, there is no difference between using $x _a-x_b$ or $x _b-x_a$, because that is the difference between positive and negative, and there is no difference after $ () ^2$. (in other words, calculating the distance between A and B, and calculating the distance between B,a, is the same.) ）

Each idea has 3 coordinates, namely: X ="takes the Airplane journey", Y ="eats the ice cream quantity", the z ="plays the time", but labels is: "Likes", "the Common likes", "does not like".

Because the coordinates have 3, so the processing space is 3 degrees of space, but for beginners convenience, we only consider two of the coordinates, so confined to 2 degrees space (flat). If you want to generalize to n-dimensional space, the formula that uses the distance between two points in the N-degree space is left to the reader as an exercise ?

When writing a program, you first need to know how those ideas are stored in a variable. In the "dating" example in the book, the coordinates of those ideas are datingdatamat, and labels are stored in datinglabels.

(You can call the File2matrix function on the Python command line to prepare those ideas, and then try to print out the contents of the two variables of Datingdatamat and Datinglabel.) For example datingdatamat[:, 1] You can print the second coordinate of all the pips (that is, the amount of ice cream). The ":" means that you do not specify the begin and end of the indicator, so it is "all taken" for that indicator. ）

We rewrite the Python function in the book in a more superficial way (the program uses vectors and matrix notation, which is simpler, but difficult to understand):

    def classify1 (InP, DataSet, labels, k):        N = Len (DataSet)        Ds = Array ([0] * N) for        I in range (n):             x2 = (inp[ 0]-dataset[i][0]) **2             y2 = (inp[1]-dataset[i][1]) **2             d = sqrt (x2 + y2)             ds[i] = d

The 1th sentence: define our function. InP is the meaning of input point.

The 2nd sentence: N is the size of our dataSet, that is, the total number of ideas.

The 3rd sentence: we want to calculate the distance D, and have N such a distance, so we want to store the results in an array. But before using an array, define it and fill in 0 (this is called initialization, initialize). The Ds name means "a lot of d" (such as dogs in english = number of dog).

The 4th sentence is loop: For each point, we use I this index"point to" it. " Index is the usual practice for working with array because array allows you to read elements from anywhere.

5th, 6: Calculate the values of $\delta x^2$ and $\delta y^2$, note that because we store X, Y's method is [x, y] Such a list, so x is read with indicator [0], Y is read with indicator [1].

7th sentence: Calculate $D = \sqrt{\delta x^2 + \delta y^2}$.

The 8th sentence: Put the calculated D into the array Ds.

2. Sort the distances

Very simple, one sentence:

    dsorted = Ds.sort ()

Note that these program passages are indent good, otherwise Python will be wrong. This sentence still belongs to the CLASSIFY1 function.

3. The color of the readout point

I was wrong just now, because after the Ds sort, I couldn't figure out which idea corresponds to which label, so we're going to use the "arg sort" (argument sort, which is sorted by indicator ).

For example, set a = Array ([17, 38, 10, 49]),

A.sort () will give [10, 17, 38, 49],

But A.argsort () will give [2, 0, 1, 3], these are indices (indicators).

In other words: The result of Argsort is the new platoon of old indices .

Python Program:

    d_sorted = Ds.argsort ()    first_k = d_sorted[0:k]        # Extract the first k elements (but this is actually not required)

Now to find out the labels of this K-idea, we can create a new array to store them:

    First_k_labels = Array ([0]*k)  # Prepare empty array for    I in range (0,k):        first_k_labels[i] = labels[d_sorted[i]]

The last sentence to illustrate: If write labels[i] that is the first element of the label. But what we want is the label of the element after sorting , so we first "look up" this d_sorted array, find the indicator of the sorted element, and then use that indicator "look up" that labels array. (Just like looking up a dictionary, we want to check the Japanese translation of some Chinese characters, but we only have a dictionary and an English-Japanese dictionary, so we have to look up two times, appear array1[array2[i]) such syntax. This is very common. ）

4. Which color appears most?

Use this loop to calculate the number of occurrences of each label:

    Like1 = 0              # tag is the number of dots that "don't like this Boy"    like2 = 0              # tag is the number of dots that "normal likes this boy"    like3 = 0              # tag is the number of points "very like this boy" for    I In Range (0, K):        label = first_k_labels[i]        if label = = 1:            like1 + = 1        elif label = = 2:            like2 + = 1        elif label = = 3:            Like3 + = 1

Then, find the most occurrences of the label:

    If Like1 &gt; Like2 and Like1 &gt; Like3:      # If Like1 appears up to        Best_label = 1                        # The answer is "don't like the Boy"    elif like2 &gt; like1 and Like2 &gt; like3:
   # if Like2 appears up to        Best_label = 2                        # The answer is "normal like this Boy"    else:                                    # If Like3 appears up to        Best_label = 3                        # The answer is "very much like this Boy

5. Finished!

    Return Best_label

Testing

The actual Python program, plus these few lines of "header":

#-*-Coding:utf-8-*-           # added this sentence can be in Chinese comments fromnumpy import arrayfrom math import sqrt

Operation Result:

In this example, "?　The coordinates are [4.54e+04, 4.98e+00], which I made after reading the data. 4.54E+04 is scientific notation, that is $4.54 \times 10^4$.

I added k to 600 before I began to see Like2 not 0.

What is the KNN algorithm?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

What is the KNN algorithm?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

What is the KNN algorithm?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support