KNN in Data mining

Source: Internet
Author: User

K Nearest neighbor algorithm is a non-parametric method used frequently in classification problems. The algorithm is clear and concise: for the sample to be categorized, find its nearest K-sample (k in the training sample). The K-samples are then voted on, and the sample to be divided is consistent with the category of most samples.

There are two main problems in the algorithm: 1, what is the most recent evaluation? 2, in the end K equals how much?

For the first question, we have three different scenarios to discuss:

A. Nominal attribute: If the property value of the sample is the same, the distance of two samples is 0, otherwise 1. Example: There are two samples, one of which is gender, if the gender of two samples is male, the distance is 0, if one is male and female, the distance is 1.

B. Ordinal attributes: Consider a student's grade rating as below {poor,fair,ok,good,perfect}. We can do this by mapping each level to a sequential integer {poor=0,fair=1,ok=2,good=3,perfect=4} that starts at 0. How the scores of two students were good and fair, we can define distance distance=3-1=2.

C. Continuous properties: √∑ ((x-y) (x-y) can be measured with Euclidean distance. such as the distance between two points (3,4) distance =√ ((1-3) * (1-3) + (2-4) * (2-4)) =√8 = 2√2.

If a sample contains the above three attributes, we need to take the distance after normalization of each attribute. Or choose other algorithms such as decision trees, naive Bayes, etc.

for the second question, I think the better way is to test. Set up a confirmation sample set and then try to find out which K value is the better one. Of course, for large-scale data this method may not be very good, then the experience and judgment of engineers is particularly important. A lot of data suggest that K value between 3-10, experience shows that the K value can better control noise interference.

Features of the k nearest neighbor algorithm: A. There is no need to build a model (also known as a passive learning method), but the computational overhead is very high, and each time a sample is judged the distance from the sample to all training samples is calculated.

B. You can create boundaries of arbitrary shapes, whereas a decision tree algorithm can only produce linear boundaries.

C. The appropriate distance metrics are important.

This article is from "Lu Yao" blog, please be sure to keep this source http://cwxfly.blog.51cto.com/6113982/1690589

KNN in Data mining

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.