K Nearest Neighbor algorithm

Source: Internet
Author: User

K Nearest neighbor algorithm is called KNN algorithm, this algorithm is a relatively classical machine learning algorithm, the overall KNN algorithm is relatively easy to understand the algorithm. The k represents the closest to their own K data samples. The KNN algorithm and the K-means algorithm are different, the K-means algorithm is used to cluster, to determine what is a relatively similar type, and the KNN algorithm is used to do the classification, that is, there is a sample space in the sample is divided into a few types, and then, given a data to be classified, Determine which classification the data to classify belongs to by calculating the nearest K sample. you can simply understand that it is up to the nearest K-point to vote on which category of data is classified .

There is a more classic figure in the KNN entry on Wikipedia:

From this we can see that there are two types of sample data in the figure, one is the blue square, the other is the red triangle. And that green circle is the data we're classifying.

    • If k=3, then the nearest green point has 2 red triangles and a blue square, these 3 points to vote, so the green of the classification point belongs to the red triangle.
    • If k=5, then the nearest green point has 2 red triangles and 3 blue squares, these 5 points to vote, so the green of the classification point belongs to the Blue Square.

We can see that the nature of machine learning-- is based on a method of data statistics ! So what's the use of this algorithm? Let's look at a few examples.

Product quality judgment

Suppose we need to judge the quality of paper towels, the quality of paper towels can be drawn like two vectors, one is "acid corrosion time", one is "can withstand the pressure." If our sample space is as follows: (so-called sample space, also called training data, which is used for machine learning)

Vector X1

Acid Time (seconds)

Vector X2

Tumbler Strong (kg/m²)

Quality y

7

7

Bad

7

4

Bad

3

4

Good

1

4

Good

So, if X1 = 3 and X2 = 7, what is the quality of this towel? Here you can use the KNN algorithm to judge.

Suppose k=3,k should be an odd number, so that there will be no flat ticket, and here is the distance we calculate (3,7) to all points. (For those distance formulas, refer to the distance formula in the K-means algorithm)

Vector X1

Acid Time (seconds)

Vector X2

Tumbler Strong (kg/m²)

Calculate the distance to (3, 7)

Vector y

7

7

Bad

7

4

N/A

3

4

Good

1

4

Good

So, the final vote, good has 2 votes, bad has 1 votes, the final need to test (3,7) is a qualified product. (Of course, you can also use weights--you can take the distance value as a weight, the closer the weight, the more likely it will be more accurate)

Note: Example from here, k-nearestneighbors Excel table download

Forecast

Suppose we have the following set of data, assuming that X is the number of seconds elapsed, and that the Y value is a value that changes over time (you can imagine a stock value)

So, when the time is 6.5 seconds, what is the Y value? We can use the KNN algorithm to predict.

Here, let's assume k=2, so we can calculate the distance of all X points to 6.5, such as: x=5.1, distance is | 6.5–5.1 | = 1.4, X = 1.2 so the distance is | 6.5–1.2 | = 5.3. So we get the following table:

Note that because of the k=2, so the x=4 and X =5.1 points are obtained recently, the values of Y are 27 and 8 respectively, in which case we can simply use the average value to calculate:

Therefore, the final prediction value is: 17.5

Note: Example from here, knn_timeseries Excel table download

Interpolation, smoothing curves

KNN algorithm can also be used for smoothing the curve, this usage is more alternative. If our sample data is as follows (as in the above):

To smooth these points, we need to insert some values into them, for example, we start with a step of 0.1, starting from 0 to 6, calculating the distance (absolute value) of all x points, and giving the data from 0 to 0.5:

Given the 11 values inserted from 2.5 to 3.5, and then calculating their distance to each x, the false value is k=4, then we use the Y value of the last 4 x, and then the average, get the following table:

So can be from 0.0, 0.1, 0.2, 0.3 .... 1.1, 1.2, 1.3.....3.1, 3.2.....5.8, 5.9, 6.0 a large table, according to the value of K, get the following figure:

Note: Example from here, knn_smoothing Excel table download

Postscript

Finally, I want to say two more things,

1) One is machine learning, the algorithm is basically relatively simple, the most difficult is the mathematical modeling, the characteristics of those services are abstracted into the process of vector, the other is to select the appropriate model of the data sample. These two things are not simple things. The algorithm is rather simple.

2) for KNN algorithm found in the nearest to their K points, is a very classic algorithm interview questions, need to use the data structure is "maximum heap--max heap", a binary tree. You can look at the relevant algorithms.

Http://coolshell.cn/articles/8052.html

K Nearest Neighbor algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.