"Java implementation" K-Nearest-neighbor (KNN) Classification algorithm

Source: Internet
Author: User
Tags comparable pow

KNN algorithm belongs to supervised learning algorithm, which is a very simple algorithm for classification. Simply put, the KNN algorithm uses the method of measuring the distance between different eigenvalues to classify. The specific algorithm is as follows:

1 calculate the distance between the point in the known category DataSet and the current point

2) sorted by distance increment order

3 Select the K point with the current minimum distance

4 to determine the frequency of the category where the first k points are occurring

5) return the highest frequency of the first k points of the category as the current point of the forecast classification


The data set comes from the dating site of the "machine Learning Combat" book, which matches this case. The format is as follows:


The four columns are: Frequent flyer mileage per year, percentage of time spent playing video games, the amount of ice cream litres consumed per week, and evaluation of the object. In order to convert the last evaluation into a number, I defined the rule as follows: Didntlike is 1;smalldoses for 2;largedoses 3; A total of 900 training data, and another 100 test data in the same format for the final calculation of error rates.

This three-dimensional dataset is given a scatter figure of two columns as follows:



Take different characteristics, the final rendering of the effect is also different. But from the above two figure can be seen, the experimental data set presents clustering phenomenon, facilitate classification.

Package KNN; /** * @author Shenchao * * * Encapsulates a data * */public class data implements comparable<data>{/** * Annual Frequent flyer mileage
	Number * * private double mile;
	/** * The percentage of time spent playing video games * * Private double times;
	/** * Weekly consumption of ice cream litres/private double icecream;
	/** * 1 for people who don't like * 2 for charismatic people * 3 represents charismatic person/private int type;
	
	/** * Two data distance */private double distance;
	Public double Getmile () {return mile;
	The public void Setmile (double mile) {this.mile = mile;
	Public double GetTime () {return time;
	public void SetTime (double time) {this.time = time;
	Public double Geticecream () {return icecream;
	The public void Seticecream (double icecream) {this.icecream = icecream;
	public int GetType () {return type;
	public void SetType (int type) {this.type = type;
	Public double getdistance () {return distance;
	The public void setdistance (double distance) {this.distance = distance; }/* (non-javadoc) * @see java.lang.Comparable#compareTo (java.lang.Object) * Here to reverse sort/@Override public int compareTo (Data o) {if this.distance < O.GETDI
		Stance ()) {return-1;
		}else if (This.distance > O.getdistance ()) {return 1;
	return 0;
 }
}
The encapsulation of the data, for the sorting after the implementation of the comparable interface, rewrite the CompareTo method.

Package KNN;
Import Java.io.BufferedReader;
Import java.io.IOException;
Import Java.io.InputStreamReader;
Import java.util.ArrayList;
Import java.util.Collections;
Import Java.util.HashMap;
Import java.util.List;

Import Java.util.Map;

	public class KNN {private list<data> dataset = null;
	Public KNN (String fileName) throws IOException {DataSet = Initdataset (FileName); Private list<data> Initdataset (String fileName) throws IOException {list<data> List = new arraylist<

		Data> (); BufferedReader BufferedReader = new BufferedReader (New InputStreamReader (KNN.class.getClassLoader (). GETRESOURC
		Easstream (FileName));
		String line = null;
			while (line = Bufferedreader.readline ())!= null) {Data data = new data ();
			String[] s = line.split ("T");
			Data.setmile (Double.parsedouble (s[0));
			Data.settime (Double.parsedouble (s[1));
			Data.seticecream (Double.parsedouble (s[2));
			if (S[3].equals ("largedoses")) {Data.settype (3); } else if (s[3].equals ("smalldoses")) {Data.settype (2);
			else {data.settype (1);
		} list.add (data);
	} return list; /** * Algorithm Core * * @param data * @param DataSet * @param k */public int (KNN) (data data, list<data> dat
			aset, int k) {for (data Data2:dataset) {Double distance = caldistance (data, data2);
		Data2.setdistance (distance);
		//To sort the distance, reverse Collections.sort (dataset);
		In the former K sample, find the highest frequency category int type1 = 0, type2 = 0, type3 = 0;
			for (int i = 0; i < K; i++) {Data d = dataset.get (i);
				if (d.gettype () = = 1) {++type1;
			Continue
				else if (d.gettype () = = 2) {++type2;
			Continue
			else {++type3;
		}//System.out.println (type1 + "========" + type2 + "=========" + type3);
			if (Type1 > Type2) {if (Type1 > Type3) {return 1;
			}else {return 3;
			}}else {if (type2 > Type3) {return 2;
			}else {return 3; }}/** * Calculates the distance between two sample points * * @param data * @param data2 * @return/private double caldistance (data data, data data2) {double = Math.po W ((Data.getmile ()-Data2.getmile ()), 2) + Math.pow ((Data.geticecream ()-Data2.geticecream ()), 2) + Math.pow ((dat
		A.gettime ()-Data2.gettime ()), 2);
	return math.sqrt (sum); /** * Normalization of data sets <br> * <br> * newvalue = (oldvalue-min)/(max-min) * * @param Dataset2 * @return * * Private list<data> Autonorm (list<data> olddataset) {list<data> NewDataSet = new ArrayLi
		St<data> ();
		Find Max and Min map<string, double> Map = Findmaxandmin (Olddataset); for (Data data:olddataset) {data.setmile (Calnewvalue data.getmile (), Map.get ("MaxDistance"), Map.get ("MinDistan
			Ce "));
			Data.settime (Calnewvalue (Data.gettime (), Map.get ("MaxTime"), Map.get ("Mintime"));
			Data.seticecream (Calnewvalue (Data.geticecream (), Map.get ("Maxicecream"), Map.get ("Minicecream")); Newdataset.add (dATA);
	return newdataset;
	 /** * @param oldValue * @param maxValue * @param minvalue * @return newvalue = (oldvalue-min)/(Max-min) * Private Double Calnewvalue (double oldValue, double maxValue, double minvalue) {return (double) (Oldvalue-minvalu
	e)/(Maxvalue-minvalue); /** * Find the Max and the min * * @return/private map<string, double> findmaxandmin (LIST&LT;DATA&G T

		Olddataset) {map<string, double> Map = new hashmap<string, double> ();
		Double maxdistance = Integer.min_value;
		Double mindistance = Integer.max_value;
		Double maxtime = Double.min_value;
		Double mintime = Double.max_value;
		Double maxicecream = Double.min_value;

		Double minicecream = Double.max_value;
			for (Data data:olddataset) {if (Data.getmile () > maxdistance) {maxdistance = Data.getmile ();
			} if (Data.getmile () < mindistance) {mindistance = Data.getmile (); } if (Data.gettime () > MaxTime) {maxtimE = Data.gettime ();
			} if (Data.gettime () < mintime) {mintime = Data.gettime ();
			} if (Data.geticecream () > Maxicecream) {maxicecream = Data.geticecream ();
			} if (Data.geticecream () < Minicecream) {Minicecream = Data.geticecream ();
		} map.put ("MaxDistance", maxdistance);
		Map.put ("Mindistance", mindistance);
		Map.put ("MaxTime", maxtime);
		Map.put ("Mintime", mintime);
		Map.put ("Maxicecream", Maxicecream);

		Map.put ("Minicecream", Minicecream);
	return map;
	/** * Renders the dataset as a scatter plot/public void show () {new Scatterplotchart (). Showchart (DataSet); /** * Take 10% of the existing data as test data, here we selected 100 samples as test samples, and the rest as training samples * @throws IOException/public void Test () throws Ioexcepti
		on {list<data> Testdataset = Initdataset ("test.txt");
		Normalized data list<data> Newtestdataset = Autonorm (Testdataset);
		List<data> NewDataSet = autonorm (DataSet);
		int errorcount = 0; for (data data:newtestdataset) {int type = KNN (data,NewDataSet, 6);
			if (Type!= data.gettype ()) {++errorcount;
	} System.out.println ("Error Rate:" + (Double) errorcount/testdataset.size () * 100 + "%");
		public static void Main (string[] args) throws IOException {KNN KNN = new KNN ("DatingTestSet.txt");
	Knn.test ();
 }

}
The distance between the two sample points is still calculated using Euclidean calculation, but the important step is to return the dataset to one, which is also the Euclidean algorithm. Because the maximum number of differences in the value of the property has a significant impact on the results, that is to say, the annual number of frequent flyer mileage for the calculation results will be far greater than the other two characteristics. So here you map the dataset to [0,1].

Finally, the number of errors in the test set divided by the total number of tests to get the error rate, the program run the result is 8%, still OK.

the algorithm is deficient: KNN algorithm is one of the simplest and most effective algorithms for classifying data, but the algorithm must save all the data sets, if the training dataset is large, you must use a lot of storage space, in addition, because each data in the dataset must be calculated distances, it can be very time-consuming to use. Another flaw is that it does not provide the infrastructure information for any data, and therefore we do not know what characteristics the average instance sample and typical sample have.

If you have any questions, you are welcome to study and exchange with me.



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.