Reprint please indicate the source: http://blog.csdn.net/xiaojimanman/article/details/51064307
Http://www.llwjy.com/blogdetail/f74b497c2ad6261b0ea651454b97a390.html
Personal Blog Station has been online, the Web site www.llwjy.com ~ welcome you to spit out the groove ~
-------------------------------------------------------------------------------------------------
Before starting a small ad, create a QQ group: 321903218, click on the link to join the group "Lucene case Development", mainly used to exchange how to use Lucene to create site search background, but also at irregular intervals in the group open the relevant open class, Interested children's shoes can be added to the exchange.
KNN algorithm, also known as the nearest neighbor algorithm, is a commonly used classification algorithm in data mining, and the core idea of the KNN algorithm is to search for the nearest k individuals, which belong to the category of the target. For example, K is 7, so we find 7 samples from the data that are closest to the target (or the highest similarity), the categories for which the 7 samples were added were A, B, C, A, A, a, B, and the target category was a (because the 7 samples had the highest number of samples in Category A).
Algorithm Implementation
definition of training data format
Here is a simple introduction to how to use Java to achieve the KNN classification, first we need to store the training set (including attributes and corresponding categories), where we have to the unknown attributes using generics, the category we use String storage.
/**
* @Description: A record storage format in the KNN classification model/
package Com.lulei.datamining.knn.bean;
public class knnvaluebean<t>{
private T value;//record value
private String typeid;//category ID
public Knnvaluebean (T value, String typeid) {
this.value = value;
This.typeid = typeid;
}
Public T GetValue () {return
value;
}
public void SetValue (T value) {
this.value = value;
}
Public String GetTypeId () {return
typeid;
}
public void Settypeid (String typeid) {
This.typeid = typeid;
}
}
Second, K nearest neighbor category data format definition
In the statistics to get K nearest neighbor, we need to record the first k samples of the classification and corresponding similarity, we use the following data format:
/**
* @Description: K nearest-neighbor category score/
package Com.lulei.datamining.knn.bean;
public class Knnvaluesort {
private String typeid;//category ID
private double score;//the category score public
Knnvaluesort ( String typeID, double score {
This.typeid = typeid;
This.score = score;
}
Public String GetTypeId () {return
typeid;
}
public void Settypeid (String typeid) {
This.typeid = typeid;
}
Public double Getscore () {return
score;
}
public void SetScore (double score) {
this.score = score;
}
}
Basic attributes of KNN algorithm
In the KNN algorithm, the most important indicator is the value of k, so we need to set a property k in the base class and set an array to store the data of the known categories.
Private list<knnvaluebean> DataArray;
private int K = 3;
Iv. Adding known classified data
Before using the KNN classification, we need to add data to it that we know is classified, and then use that data to predict the classification of unknown data.
/**
* @param value *
@param typeid *
@Author: Lulei
* @Description: Add a record to the model/public
void AddRecord (T value, String typeid) {
if (DataArray = null) {
DataArray = new arraylist<knnvaluebean> ();
}
Dataarray.add (new knnvaluebean<t> (value, typeid));
}
similarity of five or two samples (or distance)
In the KNN algorithm, the most important method is how to determine the similarity between two samples (or distance), because we are using generics, there is no way to determine the similarity between two objects, once we set it to abstract methods, let subclasses to implement. Here our method is defined as the similarity, that is, the larger the return value, the more similar the two, the shorter the distance between .
/**
* @param o1 *
@param o2 *
@return
* @Author: Lulei
* @Description: Similarity between O1 O2
* * Public
abstract Double Similarscore (t O1, T O2);
Vi. getting the nearest K-sample classification
The core idea of KNN algorithm is to find the nearest K nearest neighbor, so this step is also the core of the whole algorithm. Here we use the array to save the similarity of the largest K-sample classification and similarity, in the calculation of the process by looping through all the samples, the array holds up to the current calculation point of the most similar k samples corresponding to the category and similarity, specifically implemented as follows:
/**
* @param value *
@return
* @Author: Lulei
* @Description: Get the nearest k classification * *
Private Knnvaluesort[] Getktype (T value) {
int k = 0;
knnvaluesort[] topk = new Knnvaluesort[k];
for (knnvaluebean<t> Bean:dataarray) {
Double score = Similarscore (Bean.getvalue (), value);
if (k = = 0) {
//The number of records in the array is 0 is directly added
topk[k] = new Knnvaluesort (Bean.gettypeid (), score);
k++;
} else {
if (!) ( K = = k && score < Topk[k-1].getscore ()) {
int i = 0;
Locate the point you want to insert for
(; I < K && score < Topk[i].getscore (); i++);
int j = k-1;
if (K < k) {
j = k;
k++;
}
for (; j > i; j--) {
topk[j] = topk[j-1];
}
Topk[i] = new Knnvaluesort (Bean.gettypeid (), score);
}}} return TOPK;
}
Statistics of the most frequent categories of K-samples
This step is a simple count, the number of k samples in the most occurrences of the classification, which is the classification of the target data we want to predict.
/**
* @param value *
@return
* @Author: Lulei
* @Description: KNN classification to judge the category of value *
* public
String GetTypeId (T value) {
knnvaluesort[] array = getktype (value);
hashmap<string, integer> map = new hashmap<string, integer> (K);
for (Knnvaluesort bean:array) {
if (bean!= null) {
if (Map.containskey (Bean.gettypeid ())) {
Map.put (bean . GetTypeId (), Map.get (Bean.gettypeid ()) + 1);
} else {
map.put (Bean.gettypeid (), 1);
}
}} String Maxtypeid = null;
int maxcount = 0;
Iterator<entry<string, integer>> iter = Map.entryset (). iterator ();
while (Iter.hasnext ()) {
entry<string, integer> Entry = Iter.next ();
if (Maxcount < Entry.getvalue ()) {
Maxcount = Entry.getvalue ();
Maxtypeid = Entry.getkey ();
}
}
return maxtypeid;
}
So far, the abstract base class for KNN classification has been written, before the test we say a few more, the KNN classification is the statistics k the most frequent occurrence of the classification, this in some cases is not particularly reasonable, such as K=5, the first 5 samples corresponding to the classification of A, A, B, B, B, The corresponding similarity scores are 10, 9, 2, 2, 1, and if the above method is used, the forecast classification is B, but it is more reasonable to look at the data and predict the classification as a. Based on this situation, we propose the following optimizations to the KNN algorithm (this does not provide code, only to provide a simple idea): After obtaining the most similar k samples and similarity, you can do a function of similarity and occurrence k, such as weighting, the most important classification of the function value is the prediction of the target classification.
base class source code
/** * @Description: KNN classification */package COM.LULEI.DATAMINING.KNN;
Import java.util.ArrayList;
Import Java.util.HashMap;
Import Java.util.Iterator;
Import java.util.List;
Import Java.util.Map.Entry;
Import Com.lulei.datamining.knn.bean.KnnValueBean;
Import Com.lulei.datamining.knn.bean.KnnValueSort;
Import Com.lulei.util.JsonUtil; @SuppressWarnings ({"Rawtypes"}) public abstract class Knnclassification<t> {private list<knnvaluebean>
DataArray;
private int K = 3;
public int getk () {return K;
public void setk (int K) {if (K < 1) {throw new IllegalArgumentException ("K must greater than 0"); } this.
K = k; /** * @param value * @param typeid * @Author: Lulei * @Description: Adding records to the model/public void AddRecord (T va
Lue, String typeid) {if (DataArray = null) {DataArray = new arraylist<knnvaluebean> ();
} dataarray.add (New knnvaluebean<t> (value, typeid));
}/** * @param value * @return * @Author: Lulei * @Description: KNN classification to judge the category of value * * Public String GetTypeId (T value) {knnvaluesort[] array = getktype (value);
System.out.println (Jsonutil.parsejson (array));
hashmap<string, integer> map = new hashmap<string, integer> (K); for (Knnvaluesort Bean:array) {if (bean!= null) {if (Map.containskey (Bean.gettypeid ())) {Map.put (bean.ge
Ttypeid (), Map.get (Bean.gettypeid ()) + 1);
else {map.put (Bean.gettypeid (), 1);
}} String Maxtypeid = null;
int maxcount = 0;
Iterator<entry<string, integer>> iter = Map.entryset (). iterator ();
while (Iter.hasnext ()) {entry<string, integer> Entry = Iter.next ();
if (Maxcount < Entry.getvalue ()) {maxcount = Entry.getvalue ();
Maxtypeid = Entry.getkey ();
} return Maxtypeid; /** * @param value * @return * @Author: Lulei * @Description: Get the nearest k category * * Private knnvaluesort[] Getkt
Ype (T value) {int k = 0; knnvaluesort[] Topk = new KNNVALUESORT[K];
for (knnvaluebean<t> Bean:dataarray) {Double score = Similarscore (Bean.getvalue (), value);
if (k = = 0) {//The number of records in the array is 0 is directly added topk[k] = new Knnvaluesort (Bean.gettypeid (), score);
k++; } else {if (!
K = = k && score < Topk[k-1].getscore ()) {int i = 0;
Locate the point you want to insert for (; I < K && score < Topk[i].getscore (); i++);
int j = k-1;
if (K < k) {j = k;
k++;
for (; j > i; j--) {topk[j] = topk[j-1];
} Topk[i] = new Knnvaluesort (Bean.gettypeid (), score);
}} return TOPK; /** * @param O1 * @param o2 * @return * @Author: Lulei * @Description: Similarity between O1 O2/public abstract D
Ouble Similarscore (t O1, T O2);
}
Specific Subclass Implementation
For the abstract base class described above for the KNN classification, we need to inherit the base class and implement the similarity abstract method in the base class for the actual problem, here we do a simple implementation.
/**
* @Description: * *
package com.lulei.datamining.knn.test;
Import com.lulei.datamining.knn.KnnClassification;
Import Com.lulei.util.JsonUtil;
public class Test extends knnclassification<integer>{
@Override public
double Similarscore (Integer O1, Integer O2) {
return-1 * Math.Abs (O1-O2);
}
/**
* @param args
* @Author: Lulei
* @Description: * * public
static void Main (string[) args) { C17/>test test = new test ();
for (int i = 1; i < i++) {
Test.addrecord (i, I > 5?) "0": "1");
System.out.println (Jsonutil.parsejson (Test.gettypeid (0)));
}
Here we have added 1, 2, 3, 4, 5, 6, 7, 8, 9 of these 9 sets of data, the first 5 groups of the category is 1, after 4 groups of 0, two data between the similarity between the difference between the absolute value of the opposite number, the following forecast 0 should belong to the classification, where the default value of K is 3, So the most recent K samples were 1, 2, 3, respectively, "1", "1" and "1", because the final forecast was classified as "1".
-------------------------------------------------------------------------------------------------
Small welfare
-------------------------------------------------------------------------------------------------
The individual in the Geek College "Lucene case Development" course has been online, welcome everyone to spit out the groove ~
First lesson: Lucene Overview
The second lesson: Lucene Common Function Introduction
Lesson Three: web crawler
Lesson Four: Database connection pool
Lesson Five: Collection of novel websites
Lesson Six: A novel website database operation
The seventh lesson: the realization of the novel website distributed crawler
Lesson Eighth: Lucene Real-time search