1. Principle
By calculating the distance from each training data to the tuple to be categorized, the nearest K training data is taken and the tuple to be categorized, and which category of training data in the K data is the majority, which category the classifier belongs to.
Training samples are described with n-dimensional numeric attributes. Each sample represents a point in an n-dimensional space. All training samples are placed in the n-dimensional mode space. Given a sample, K-nearest to the taxonomy search pattern space, to find the nearest unknown sample of the nine training samples.2. Required Information
- Training set
- Distance calculated value
- The number of nearest neighbors to get K
- Calculate the distance between two points
- For example, Euclidean distances can be used: D = sqrt ((x1-x2) ^2+ (y1-y2) ^2+...+ (Yn-yn) ^2)
- Determine the results of a classification from the nearest neighbor list
method One : Select the class label for most of the K nearest neighbors
- method Two : You can add weights to each poll based on distance Weight factor, w=1/d2
- Selection of K values
If k is too small, it will be too sensitive to the noise present in the data;
If k is too large, the neighbors may contain points of other classes;
An empirical rule of thumb is k≤,q for the number of training tuples. Business algorithms typically use 10 as the default value.
3. General description
Algorithm: K-nearest Neighbor Classification algorithm input: training data t; nearest neighbor number k; tuple t to be categorized. Output: Output category C. (1) n=?;//define a neighbor set(2) for each d∈t does BEGIN (3) IF | N|≤k Then//the size of n is maintained at K(4) n=n∪{d}; (5) ELSE (6IF u∈n such that sim (T,u) <Sim (t,d) then BEGIN /*If there is a data u in N, the similarity between T and U is less than the similarity between T and D (no equal to the condition guaranteed u! =d), i.e.: Newly added D can be added to n after n minus U, and D is a new member of N*/(7) N=n-{u};//Remove U(8) N=n∪{d};//Add D(9) END (Ten) END ( One) c=classTo which the most u∈n.
4.
KNN Advantages and disadvantages
Advantages : The principle is simple, the realization is more convenient. Supports incremental learning. Can model the complex decision space of hyper-polygon.
cons : High computational overhead requires efficient storage technology and support for parallel hardware.
5.Java implementations
refer to: Java implementation of K nearest neighbor (KNN) algorithm
KNN of 20151014_ based on distance classification algorithm