KNN Proximity Classification algorithm

Source: Internet
Author: User

The K-Nearest (k-nearest NEIGHBOR,KNN) classification algorithm is the simplest machine learning algorithm. It is classified by measuring the distance between different eigenvalue values. The idea is simple: calculate the distance between a point A and all other points, take out the nearest K points to that point, and then count the largest of the categories in which the K points belong, then point a belongs to the category.

Here is an example to illustrate:

Movie Name

Number of fights

Number of Kisses

Movie Type

California Mans

3

104

Romance

He ' s not really into Dudes

2

100

Romance

Beautiful Woman

1

81

Romance

Kevin Longblade

101

10

Action

Robo Slayer 3000

99

5

Action

Amped II

98

2

Action

Simply say the meaning of this data: here the number of fights and kisses to define the type of film, such as, the number of kissing is romance type, and the fight is more action movies. There is an unknown name (the name is unknown in order to prevent from the name of the movie type), the number of fights 18 times, the number of kisses 90 times the movie, it belongs to what kind of movie?

KNN algorithm to do is to first use the number of fights and the number of kisses as the coordinates of the film, and then calculate the distance between the other six movies and the unknown movie, get the first k distance of the recent movie, and then statistics this k distance in the recent movie, which type of film, such as the most action, It means that the unknown movie belongs to the action genre.

In the actual use, there are several problems are worth noting: K value selection, how much suitable? What is the best way to calculate the distance between the two? What if the amount of computation is too large? Suppose that the type distribution in the sample is very uneven, such as the action movie has 200, but romance's film only 20, so that the calculation, even if not the action of the film, because the action of the sample too much, cause K nearest neighbor has a lot of action movie, So what should I do?

There is no universal algorithm, only the optimal algorithm in a certain usage environment.

1.1 Algorithm Guideline

The guideline of the KNN algorithm is "Jinzhuzhechi, Howl", which is inferred from your neighbor's category.

Calculate the distance between the sample to be classified and the training sample of the known class, find the nearest K neighbor to the sample data to be sorted, and then judge the category of the sample data to be classified according to the categories to which the neighbors belong.

1.2 Similarity measurement

Measured by the distance of two points in space. The greater the distance, the less similar the two points are. There are many choices of distances [13], usually with relatively simple European distances.

European distance :

Markov Distance: The Markov distance can alleviate the distance distortion caused by the linear combination of attributes, which is the covariance matrix of the data.

Manhattan distance :

Chebyshev distance :

He distance : R value is 2 o'clock: Manhattan distance, R value is 1 o'clock: European distance.

Average distance :

Chord Distance :

Geodesic distance :

1.2 Determination of categories

Voting decision: The minority obeys the majority, the nearest neighbor in which category of points is divided into this class.

Weighted voting method: According to the distance, the nearest neighbor's vote weighted, the closer the distance the greater the weight (weight is the inverse of the distance squared)

Pros and Cons 1.2.1 advantages
    1. Simple, easy to understand, easy to implement, no need to estimate parameters, no training required;
    2. Suitable for the classification of rare events;
    3. Especially suitable for multi-classification problem (multi-modal, object has multiple category tags), KNN is better than SVM performance.
    4. Lazy algorithm, when the test sample classification of large computational capacity, memory overhead, slow scoring;
    5. When the sample is unbalanced, such as a class of sample capacity is very large, and the other class sample capacity is very small, it is possible that when a new sample is entered, the sample of the K neighbors of the bulk class of the majority;
    6. It is not possible to explain the rules of decision trees.
1.2.2 Disadvantage 1.3 FAQ 1.3.1 K value Setting

K-Value selection is too small, resulting in fewer neighbors, will reduce the accuracy of the classification, but also amplify the noise data interference, and if the K value selection is too large, and to classify the sample belongs to the training set contains less data number of classes, then in the selection of K neighbors, actually not similar data is also included in the resulting in increased noise resulting in reduced classification effects.

How to choose the proper K value also becomes the research hotspot of KNN. The k value is usually determined using a cross-examination (k=1 as the benchmark).

Rule of thumb: K is generally lower than the square root of the number of training samples.

How to determine the 1.3.2 category

The voting method does not take into account the proximity of the nearest neighbor, the nearest neighbor may be more likely to decide the final classification, so the weighted voting method is more appropriate.

Selection of 1.3.3 distance measurement mode

The impact of high dimensions on distance measurement: It is well known that the more the number of variables, the more the Euclidean distance is less discriminating.

The effect of variable range on distance: The variable with the larger range is often dominated by the distance calculation, so the variables should be normalized first.

Reference principles for 1.3.4 training samples

Scholars study the selection of training samples in order to reduce the calculation, these algorithms can be broadly divided into two categories. The first class, which reduces the size of the training set. The KNN algorithm stores sample data that contains a large amount of redundant data, which increases storage overhead and computational cost. The methods to reduce the training samples are as follows: Delete some sample samples which are not relevant to the classification in the original sample, use the remaining samples as new training samples, or select some representative samples as new training samples in the original training sample, or cluster the center points produced by clustering as new training samples.

In the training set, some samples may be more worthy of reliance. Different weights can be applied to various samples to enhance the weight of dependent samples and reduce the impact of unreliable samples.

1.3.5 Performance Issues

KNN is a lazy algorithm, and the consequence of laziness: the construction model is very simple, but in the test sample classification of the system overhead, because to scan all training samples and calculate the distance.

There are a number of ways to improve the efficiency of calculations, such as compressing training samples.

1.4 Algorithmic Flow
    1. Prepares data for preprocessing of data
    2. Use the appropriate data structure to store training data and test tuples
    3. Set parameters, such as K
    4. Maintains a large-to-small priority queue of size K for storing nearest neighbor training tuples. Randomly selects K tuples from the training tuple as the initial nearest neighbor tuple, calculates the distance from the test tuple to the K-tuple, and deposits the training tuple label and distance into the priority queue
    5. Iterates through the set of training tuples, calculates the distance between the current training tuple and the test tuple, and lmax the resulting distance from l to the maximum distance in the priority queue
    6. For comparison. If L>=lmax, the tuple is discarded and the next tuple is traversed. If L < Lmax, delete the maximum distance in the priority queue
    7. Group, the current training tuple is stored in the priority queue.
    8. After traversal, most classes of k tuples in the priority queue are computed and used as categories of the test tuple.

9. Test the meta-set test after the calculation of the error rate, continue to set different K values retraining, and finally take the minimum error rate K value.

Java Code Implementation

 Public classKNN {/*** Set Priority queue comparison function, the greater the distance, the higher the priority*/    PrivateComparator<knnnode> Comparator =NewComparator<knnnode>(){         Public intCompare (Knnnode O1, Knnnode O2) {if(O1.getdistance () >=o2.getdistance ())return-1; Else                return1;    }    }; /*** Get k different random numbers *@paramk number of random numbers *@paramMax random number range *@returngenerated array of random numbers*/     PublicList<integer> Getrandknum (intKintmax) {List<Integer> Rand =NewArraylist<integer>(k);  for(inti = 0; I < K; i++) {            inttemp = (int) (Math.random () *max); if(!rand.contains (temp)) Rand.add (temp); ElseI--; }        returnRand; }
/*calculate the distance before the test tuple and the training tuple*@param d1 Test tuples*@param d2 Training tuples* @returnDistance Value*/ Public DoubleCaldistance (list<double> d1, list<double>D2) {        DoubleDistance = 0.00;  for(inti = 0; I < d1.size (); i++) Distance+ = (D1.get (i)-d2.get (i)) * (D1.get (i)-D2.get (i)); returndistance; }        /*** Perform the KNN algorithm to get the category of the test tuple *@paramdatas Training Data Set *@paramtestData Test tuples *@paramK -Value set by K. *@returncategories of test tuples*/     PublicString KNN (list<list<double>> datas, list<double> TestData,intk) {Priorityqueue<KNNNode> PQ =NewPriorityqueue<knnnode>(K,comparator); List<Integer> Randnum =Getrandknum (k, datas.size ());  for(inti = 0; I < K; i++) {            intindex =Randnum.get (i); List<Double> Currdata =Datas.get (index); String C= Currdata.get (Currdata.size ()-1). toString (); Knnnode node=NewKnnnode (Index, Caldistance (TestData, Currdata), c);        Pq.add (node); }         for(inti = 0; I < datas.size (); i++) {List<Double> T =Datas.get (i); DoubleDistance =caldistance (TestData, T); Knnnode Top=Pq.peek (); if(Top.getdistance () >distance)                {Pq.remove (); Pq.add (NewKnnnode (i, Distance, T.get (T.size ()-1).            ToString ())); }        }        returnGetmostclass (PQ);
}

/*** Get most classes of K nearest neighbor tuples *@paramPQ Stores the priority queue of K nearest neighbor tuples *@returnname of most classes*/    PrivateString Getmostclass (priorityqueue<knnnode>PQ) {Map<string, integer> classcount=NewHashmap<string,integer>(); intPqsize =pq.size ();  for(inti = 0; i < pqsize; i++) {knnnode node=Pq.remove (); String C=node.getc (); if(Classcount.containskey (c)) Classcount.put (c, Classcount.get (c)+ 1); ElseClasscount.put (c,1); }        intMaxindex =-1; intMaxCount = 0; Object[] Classes=Classcount.keyset (). ToArray ();  for(inti = 0; i < classes.length; i++) {             if(Classcount.get (classes[i]) >maxCount) Maxindex= i; MaxCount =Classcount.get (Classes[i]); }         returnclasses[maxindex].tostring (); }}

KNN Proximity Classification algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.