Introduction
The KNN algorithm full name is K-nearest Neighbor, is the meaning of K nearest neighbor. KNN is also a classification algorithm. But compared with the previous decision tree classification algorithm, this algorithm is the simplest one. The main process of the algorithm is:
1, given a training set of data, each training set of data are already divided into good class.
2. Set an initial test data A, calculate the Euclidean distance of all data from a to the training set, and sort.
3, select the training set from a distance of the nearest K training sets data.
4, compare K training set data, select the most of the classification type, this classification is the final test data a classification.
The following Baidu encyclopedia on a diagram:
KNN Algorithm Implementation
The first test data requires 2 blocks, one is the training set data, is already divided into good class data, such as the non-green points in. Another is the test data, which is the green point above, of course, the test data here will not be one, but a group. The distance between the data and the data is calculated using the eigenvector of the data, and the eigenvector can be multidimensional. The similarity is calculated by calculating Euclidean distance between eigenvectors and eigenvectors. Define training Set Data trainInput.txt:
the data to be tested Testinput, only the eigenvector value:
here is the main program:
Package Dataming_knn;import Java.io.bufferedreader;import Java.io.file;import java.io.filereader;import Java.io.ioexception;import Java.util.arraylist;import Java.util.arrays;import Java.util.Collection;import Java.util.collections;import Java.util.comparator;import Java.util.hashmap;import Java.util.Map;import org.apache.activemq.filter.comparisonexpression;/** * k Nearest Neighbor Algorithm Tool class * * @author Lyq * */public class Knntool {//Set weights for 4 categories, default Recognition weight than consistent public int[] Classweightarray = new int[] {1, 1, 1, 1};//test data address private String testdatapath;//training set data address private stri Different types of NG traindatapath;//classification private arraylist<string> classtypes;//result data Private arraylist<sample> resultsamples;//Training Set data list container private arraylist<sample> trainsamples;//Training set data private string[][] traindata;// Test set Data private string[][] testdata;public knntool (String traindatapath, String testdatapath) {This.traindatapath = Traindatapath;this.testdatapath = Testdatapath;readdataformfile ();} /** * Read the test number and training data set */private void Readda from the fileTaformfile () {arraylist<string[]> Temparray;temparray = Filedatatoarray (traindatapath); trainData = new String[ Temparray.size ()][];temparray.toarray (traindata) classtypes = new arraylist<> (); for (string[] s:temparray) {if (!classtypes.contains (S[0])) {//Add type Classtypes.add (S[0]);}} Temparray = Filedatatoarray (testdatapath); testData = new String[temparray.size ()][];temparray.toarray (testData);} /** * Convert file to list data output * * @param filePath * Data file contents */private arraylist<string[]> filedatatoarray (String fil Epath) {File File = new file (FilePath); arraylist<string[]> DataArray = new arraylist<string[]> (); try {bufferedreader in = new BufferedReader (new FileReader (file)); String str; String[] Temparray;while ((str = in.readline ()) = null) {Temparray = Str.split ("");d Ataarray.add (Temparray);} In.close ();} catch (IOException e) {e.getstacktrace ();} return DataArray;} /** * Calculate Euclidean distance for sample eigenvectors * * @param F1 * To compare samples 1 * @param f2 * to compare Samples 2 * @return */private int computeeuclideandistance (sample s1, sample S2) {string[] F1 = S1.getfeatures (); string[] F2 = S2.getfeatures ();//euclidean distance int distance = 0;for (int i = 0; i < f1.length; i++) {int subF1 = Integer.parse Int (f1[i]), int subF2 = Integer.parseint (F2[i]);d istance + = (SUBF1-SUBF2) * (SUBF1-SUBF2);} return distance;} /** * Calculates k nearest neighbor * @param k * within the K range of how much */public void knncompute (int k) {String className = ""; string[] Tempf = null; Sample temp;resultsamples = new arraylist<> (); trainsamples = new arraylist<> ();//Category Count Hashmap<string, Integer> classcount;//category weight than hashmap<string, integer> classweight = new hashmap<> ();// First, the test data is translated into the result data for the for (string[] s:testdata) {temp = new Sample (s); Resultsamples.add (temp);} For (string[] s:traindata) {className = S[0];tempf = new String[s.length-1]; System.arraycopy (S, 1, Tempf, 0, s.length-1), temp = new Sample (ClassName, Tempf); Trainsamples.add (temp);} Training Set Data arraylist<sample> knnsample = new ArrayList from the nearest sorted sample<> ();//calculates the nearest K training set data for the training data set (sample S:resultsamples) {ClassCount = new hashmap<> (); int index = 0;for (String type:classtypes) {//Start count is 0classcount.put (type, 0); Classweight.put (type, classweightarray[index++]);} for (Sample ts:trainsamples) {int dis = computeeuclideandistance (s, TS); ts.setdistance (dis);} Collections.sort (Trainsamples); Knnsample.clear ();//Select the first k data as the classification criteria for (int i = 0; i < trainsamples.size (); i++) {if (i < K) {Knnsample.add (Trainsamples.get (i));} else {break;}} The classification criteria for determining the majority of K training data for (Sample s1:knnsample) {int num = Classcount.get (S1.getclassname ())//////////////////////////////// Near right significant, far weights small num + = Classweight.get (S1.getclassname ()); Classcount.put (S1.getclassname (), num);} int MaxCount = 0;//filter out the maximum of one of the K training set data for (Map.entry Entry:classCount.entrySet ()) {if ((Integer) Entry.getvalue () > m Axcount) {MaxCount = (Integer) entry.getvalue (); S.setclassname ((String) Entry.getkey ());}} System.out.print ("Test Data feature:"); for (String s1:s.getfeatures ()) {sYstem.out.print (S1 + "");} System.out.println ("Category:" + s.getclassname ());}}}
Sample Data class:
Package dataming_knn;/** * Sample Data class * * @author lyq * */public class Sample implements comparable<sample>{//sample The classification name of the data is private String classname;//the characteristic vector of the sample data private string[] features;//test the spacing between samples to do this sort of private Integer distance; Public Sample (string[] features) {this.features = features;} Public Sample (String ClassName, string[] features) {this.classname = Classname;this.features = features;} Public String GetClassName () {return className;} public void Setclassname (String className) {this.classname = ClassName;} Public string[] Getfeatures () {return features;} public void Setfeatures (string[] features) {this.features = features;} Public Integer getdistance () {return distance;} public void setdistance (int distance) {this.distance = distance;} @Overridepublic int compareTo (Sample o) {//TODO auto-generated method Stubreturn this.getdistance (). CompareTo ( O.getdistance ());}}
Test Scenario Class:
/** * k nearest neighbor Algorithm scene type * @author Lyq * */public class Client {public static void main (string[] args) {String traindatapath = "c \ \users\\lyq\\desktop\\icon\\traininput.txt "; String Testdatapath = "C:\\users\\lyq\\desktop\\icon\\testinput.txt"; Knntool tool = new Knntool (Traindatapath, Testdatapath); Tool.knncompute (3);}}
The result of the execution is:
Test data characteristics: 1 2 3 2 4 Category: A test data characteristics: 2 3 4 2 1 Category: C test Data characteristics: 8 7 2 3 5 Category: B Test Data characteristics:-3-2, 2 4 Category: A test data characteristics:-0-4-4-4-4 Category: D Test Data characteristics: 4 1 3 4 4 Category: A test data characteristics: 4 4 3 2 1 Category: B Test Data characteristics: 3 3 3 2 4 Category: C Test Data characteristics: 0 0 1 1-2 Category: D
The output of the program is shown above and can be verified by hands-on calculations if you do not believe it.
the note points of the KNN algorithm:
1, the KNN algorithm training set data must be relatively fair, the number of different types of data should be average, otherwise when a data from 1000 B data from 100, to the time a data sample is still dominant.
2, KNN algorithm if purely by virtue of the classification of how much to judge, or can continue to optimize, such as the weight of the near data can be set large, and finally based on all types of weights and comparisons, rather than simply by virtue of the number.
3, the disadvantage of KNN algorithm is a large amount of computation, this from the program should also be seen, inside each test data to calculate to all the training set of data between the European distance, time complexity has been O (n*n), if the real data n is very large, the cost of the algorithm is indeed attitude, So KNN is not suitable for large-scale data volume classification.
KNN Algorithm Coding difficulties encountered:
Supposedly so simple KNN algorithm this should be not much difficulty, but in the multi-European distance of the sort was deep pit for a period of time, I initially use Collections.sort (list) way to sort by distance, The sample class is also implemented with the Compareable interface, but the ordering is the same, and finally it is known that the int type of the distance is changed to an integer reference type, and distance is called in the CompareTo overloaded method. CompareTo () The method succeeds, this small detail usually does not notice, is the property comparison finally must call to the reference type the CompareTo () method? This small problem actually cost me a period of time, finally carefully compared the online example finally found ...
K-Nearest Neighbor algorithm