Data analysis and Mining-R language: KNN algorithm

Last Update:2016-05-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A simple example!
Environment: CentOS6.5
Hadoop cluster, Hive, R, rhive, detailed installation and debugging methods are found in the blog documentation.

KNN algorithm steps:
All sample points (known classification + unknown classification) need to be normalized. Then, for each sample point in the dataset of the unknown cluster, do the following:
1. Calculate the distance from the current point (Unknown category) to the point in the data set of the known category.
2. Sort by distance increment
3. Select K points with the least current distance
4. Determine the frequency at which the first K points are in the category
5. The category with the highest frequency of the first K points appearing as the forecast category of the current point

Write R script:

#!/usr/bin/rscript#1, the iris to the normalization of processingiris_s <-data.frame (Scale (iris[, 1:4])) iris_s <-Cbind (iris_s, iris[, 5]) names (iris_s) [5] ="species"#2. Randomly select 100 records of the iris data set as a sample set of known classificationsSample.list <-Sample (1:150, size = 100) Iris.known <-Iris_s[sample.list,]#3, the remaining 50 records as an unknown classification of the sample set (test set)Iris.unknown <-iris_s[-Sample.list,]#4, for each sample of the test set, calculate its distance from the known sample, because it has been normalized, here directly using Euclidean distanceLength.known <-nrow (Iris.known) Length.unknown <-nrow (Iris.unknown)#5. Calculation for(Iinch1: Length.unknown) {Dis_to_known <-Data.frame (dis = Rep (0, Length.known))  for(jinch1: Length.known) {dis_to_known[j,1] <-Dist (rbind (Iris.unknown[i, 1:4], Iris.known[j,1:4]), method ="Euclidean") Dis_to_known[j,2] <-Iris.known[j, 5] Names (dis_to_known) [2] ="species"} Dis_to_known <-Dis_to_known[order (Dis_to_known$dis),] K <+ SType_freq <-As.data.frame (Table (Dis_to_known[1:k,]$Species)) Type_freq <-Type_freq[order (-type_freq$Freq),] iris.unknown[i,6] <-type_freq[1, 1]}names (iris.unknown) [6] ="Species.pre"#7. Output classification Resultsiris.unknown[, 5:6]

The results are as follows: slightly, the result concentration, the species is the sample actual classification, the Species.pre is the KNN algorithm classification, the correct rate reaches above 90%.

KNN is a supervised learning algorithm, which has the following characteristics:
1, high precision, not sensitive to outliers
2. Only numeric properties can be processed
3, high computational complexity (such as the number of samples of known classification is n, then to each unknown classification point to calculate n distance)

Problems with KNN algorithm:
1, the determination of K value is a difficult problem.
2, if the nearest K-known classification samples, the highest frequency of the type has multiple (the same frequency), how to choose the Unknown sample classification? At the moment, it's random.
3, if there are n unknown types of samples, m known types of samples, you need to calculate n*m distance, the calculation is large, and need to store all the data set, spatial complexity is also large.
4. Can we classify the samples of the remaining unknown types by adding the predicted sample categories to the set of known categories?
5, normalization is placed at the front of all processing, so that need to know all the sample set (known classification + Unknown classification) to build the classifier, and actually the unknown classification of the sample is not necessarily available beforehand, so how to normalization?

Data analysis and Mining-R language: KNN algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Data analysis and Mining-R language: KNN algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Data analysis and Mining-R language: KNN algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support