Data analysis and Mining-R language: KNN algorithm

Source: Internet
Author: User

A simple example!
Environment: CentOS6.5
Hadoop cluster, Hive, R, rhive, detailed installation and debugging methods are found in the blog documentation.

KNN algorithm steps:
All sample points (known classification + unknown classification) need to be normalized. Then, for each sample point in the dataset of the unknown cluster, do the following:
1. Calculate the distance from the current point (Unknown category) to the point in the data set of the known category.
2. Sort by distance increment
3. Select K points with the least current distance
4. Determine the frequency at which the first K points are in the category
5. The category with the highest frequency of the first K points appearing as the forecast category of the current point

Write R script:

#!/usr/bin/rscript#1, the iris to the normalization of processingiris_s <-data.frame (Scale (iris[, 1:4])) iris_s <-Cbind (iris_s, iris[, 5]) names (iris_s) [5] ="species"#2. Randomly select 100 records of the iris data set as a sample set of known classificationsSample.list <-Sample (1:150, size = 100) Iris.known <-Iris_s[sample.list,]#3, the remaining 50 records as an unknown classification of the sample set (test set)Iris.unknown <-iris_s[-Sample.list,]#4, for each sample of the test set, calculate its distance from the known sample, because it has been normalized, here directly using Euclidean distanceLength.known <-nrow (Iris.known) Length.unknown <-nrow (Iris.unknown)#5. Calculation for(Iinch1: Length.unknown) {Dis_to_known <-Data.frame (dis = Rep (0, Length.known))  for(jinch1: Length.known) {dis_to_known[j,1] <-Dist (rbind (Iris.unknown[i, 1:4], Iris.known[j,1:4]), method ="Euclidean") Dis_to_known[j,2] <-Iris.known[j, 5] Names (dis_to_known) [2] ="species"} Dis_to_known <-Dis_to_known[order (Dis_to_known$dis),] K <+ SType_freq <-As.data.frame (Table (Dis_to_known[1:k,]$Species)) Type_freq <-Type_freq[order (-type_freq$Freq),] iris.unknown[i,6] <-type_freq[1, 1]}names (iris.unknown) [6] ="Species.pre"#7. Output classification Resultsiris.unknown[, 5:6]

The results are as follows: slightly, the result concentration, the species is the sample actual classification, the Species.pre is the KNN algorithm classification, the correct rate reaches above 90%.

KNN is a supervised learning algorithm, which has the following characteristics:
1, high precision, not sensitive to outliers
2. Only numeric properties can be processed
3, high computational complexity (such as the number of samples of known classification is n, then to each unknown classification point to calculate n distance)

Problems with KNN algorithm:
1, the determination of K value is a difficult problem.
2, if the nearest K-known classification samples, the highest frequency of the type has multiple (the same frequency), how to choose the Unknown sample classification? At the moment, it's random.
3, if there are n unknown types of samples, m known types of samples, you need to calculate n*m distance, the calculation is large, and need to store all the data set, spatial complexity is also large.
4. Can we classify the samples of the remaining unknown types by adding the predicted sample categories to the set of known categories?
5, normalization is placed at the front of all processing, so that need to know all the sample set (known classification + Unknown classification) to build the classifier, and actually the unknown classification of the sample is not necessarily available beforehand, so how to normalization?

Data analysis and Mining-R language: KNN algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.