# KNN of 20151014_ based on distance classification algorithm

1. Principle

By calculating the distance from each training data to the tuple to be categorized, the nearest K training data is taken and the tuple to be categorized, and which category of training data in the K data is the majority, which category the classifier belongs to.

Training samples are described with n-dimensional numeric attributes. Each sample represents a point in an n-dimensional space. All training samples are placed in the n-dimensional mode space. Given a sample, K-nearest to the taxonomy search pattern space, to find the nearest unknown sample of the nine training samples.2. Required Information
• Training set
• Distance calculated value
• The number of nearest neighbors to get K
1. Calculate the distance between two points
1. For example, Euclidean distances can be used: D = sqrt ((x1-x2) ^2+ (y1-y2) ^2+...+ (Yn-yn) ^2)
2. Determine the results of a classification from the nearest neighbor list
1. method One : Select the class label for most of the K nearest neighbors

2. method Two : You can add weights to each poll based on distance Weight factor, w=1/d2
3. Selection of K values

If k is too small, it will be too sensitive to the noise present in the data;

If k is too large, the neighbors may contain points of other classes;

An empirical rule of thumb is k≤,q for the number of training tuples. Business algorithms typically use 10 as the default value.

3. General description

`Algorithm: K-nearest Neighbor Classification algorithm input: training data t; nearest neighbor number k; tuple t to be categorized. Output: Output category C. (1) n=?;//define a neighbor set（2) for each d∈t does BEGIN (3) IF | N|≤k Then//the size of n is maintained at K（4) n=n∪{d}; (5) ELSE (6IF u∈n such that sim (T,u) <Sim (t,d) then BEGIN          /*If there is a data u in N, the similarity between T and U is less than the similarity between T and D (no equal to the condition guaranteed u! =d), i.e.: Newly added D can be added to n after n minus U, and D is a new member of N*/(7) N=n-{u};//Remove U（8) N=n∪{d};//Add D（9) END (Ten) END ( One) c=classTo which the most u∈n.`

Advantages : The principle is simple, the realization is more convenient. Supports incremental learning. Can model the complex decision space of hyper-polygon.

cons : High computational overhead requires efficient storage technology and support for parallel hardware.

5.Java implementations

refer to: Java implementation of K nearest neighbor (KNN) algorithm

