Just studied the Kmeans. Kmeans is a very simple clustering algorithm. But he relies heavily on the user's first given k value. It is unable to find random shapes and sizes of clusters. Most suitable for finding spherical clusters. His time complexity was O (TKN). The Kmeans algorithm has two core points: the formula for calculating distances & the conditions for inferring iteration stops. Generally from the European-style distance and so can be arbitrary. The criteria for inferring an iterative stop can be:
1) The center point of each cluster no longer changes to stop the iteration
2) The sum of the squares of all clusters and the center point of the cluster (SSE) is no longer changed.
3) Set the number of human iterations. Observe the experimental effect.
Clustering can be very poor when the initial cluster heart is poorly selected.
So then another person proposed two K mean (Bisectingkmeans), the core idea is: the initial cluster in two to calculate the sum of squared error and the largest of the cluster, the second point to him. Until the number of clusters that are sliced is k-stop.
In fact, quality is the continuous selection of the cluster to do k=2 Kmeans segmentation.
The smaller the value of the cluster, the lower the number of data points to the centroid of the cluster. The better the clustering effect is. So we need to divide the sum of squares of errors and the largest clusters again. The greater the squared error, the worse the cluster clustering is, the more likely it is that multiple clusters are treated as a cluster. So the first thing we need to do is divide this cluster.
Here is the code, Kmeans original code from http://blog.csdn.net/cyxlzzs/article/details/7416491, I made a few changes.
Package Org.algorithm;import java.util.arraylist;import java.util.list;/** * binary K mean value. is actually the Kmeans division of a set of k=2 that do multiple times. After each partition, the larger SSE value of the cluster will be two points. Finally make or divide the number of clusters of k is stopped * * here using a kmeans written by others before the Java implementation as the base class. * * @author l0979365428 * */public class Bisectingkmeans {private int k;//divided into how many clusters private list<float[]> dataset;// The current to be divided by the cluster private list<clusterset> cluster; Cluster/** * @param args */public static void main (string[] args) {//Initializes a Kmean object, resets K to 10BisectingKmeans BKM = new Bisectingk means (5);//Initialize test set arraylist<float[]> DataSet = new arraylist<float[]> ();d Ataset.add (new float[] {1, 2}); Dataset.add (New float[] {3, 3});d Ataset.add (new float[] {3, 4});d Ataset.add (new float[] {5, 6});d Ataset.add (New Floa T[] {8, 9});d Ataset.add (new float[] {4, 5});d Ataset.add (new float[] {6, 4});d Ataset.add (new float[] {3, 9});d Atase T.add (New float[] {5, 9});d Ataset.add (new float[] {4, 2});d Ataset.add (new float[] {1, 9});d Ataset.add (new float[] { 7, 8});//Set Raw DataSet Bkm.setdataset (DataSet);//Run Algorithm bkm.execute ();//Get cluster result//arraylist<arraylist<float[]>> cluster = Bkm.getcluster ();//view result//for (int i = 0; i < cluster.size (); i++) {//Bkm.printdataarray (CLUSter.get (i), "cluster[" + i + "]");//}}public Bisectingkmeans (int k) {//is smaller than 2 what is the meaning of the division if (K < 2) {k = 2;} THIS.K = k;} /** * Set the original dataset to be grouped * * @param dataset */public void Setdataset (arraylist<float[]> dataset) {this.dataset = DataSet;} /** * Run algorithm */public void execute () {Long startTime = System.currenttimemillis (); System.out.println ("Bisectingkmeans begins"); Bisectingkmeans (); Long endTime = System.currenttimemillis (); System.out.println ("Bisectingkmeans running Time=" + (Endtime-starttime) + "MS"); System.out.println ("Bisectingkmeans ends"); System.out.println ();} /** * Initialize */private void init () {int datasetlength = Dataset.size (), if (K > datasetlength) {k = datasetlength;}} /** * Initialize Cluster collection * * @return a cluster collection of empty data divided into K-clusters */private arraylist<arraylist<float[]>> initcluster () {arraylist< arraylist<float[]>> cluster = new arraylist<arraylist<float[]>> (); for (int i = 0; i < K; i++) {CLU Ster.add (New arraylist<float[]> ());} return cluster;} /** * KmeThe core process method of the ANS algorithm */private void Bisectingkmeans () {init (); if (K < 2) {///less than 2 is the output dataset is considered to be just a cluster clusterset cs = new ClusterS ET (); cs.setclu (DataSet); Cluster.add (CS);} Call Kmeans for two minutes cluster = new ArrayList (), while (Cluster.size () < k) {list<clusterset> CLU = Kmeans (DataSet); Clusterset cl:clu) {Cluster.add (CL);} The IF (cluster.size () = = k) break;else//order calculates their error squared and {float maxerro=0f;int maxclustersetindex=0;int i=0;for (clusterset TT: Cluster) {//calculate the sum of squared errors and derive the sum of squared errors and the largest cluster float Erroe = Commonutil.countrule (Tt.getclu (), Tt.getcenter ()); Tt.seterro (Erroe); if (Maxerro<erroe) {maxerro=erroe;maxclustersetindex=i;} i++;} Dataset=cluster.get (Maxclustersetindex). GETCLU (); Cluster.remove (Maxclustersetindex);}} int i=0;for (Clusterset sc:cluster) {Commonutil.printdataarray (SC.GETCLU (), "cluster" +i); i++;}} /** * Call Kmeans to get two clusters. * * @param dataSet * @return */private list<clusterset> Kmeans (list<float[]> DataSet) {Kmeans k = new Kmeans ( 2);//Set the original DataSet K.setdataset (DataSet);//Run Algorithm k.execute ();//GetClustering results list<list<float[]>> Clus = K.getcluster (); list<clusterset> clusterset = new arraylist<clusterset> (); int i = 0;for (list<float[]> cl:clus) {Clust Erset cs = new Clusterset (); Cs.setclu (CL) cs.setcenter (K.getcenter (). get (i)); Clusterset.add (CS); i++;} return clusterset;} Class Clusterset {private float erro;private list<float[]> clu;private float[] center;public float Geterro () {retur n Erro;} public void Seterro (float erro) {This.erro = Erro;} Public list<float[]> getclu () {return CLU;} public void Setclu (list<float[]> clu) {this.clu = CLU;} Public float[] Getcenter () {return center;} public void SetCenter (float[] center) {this.center = center;}}}
Package Org.algorithm;import java.util.list;/** * Draw out formulas for calculating distances and errors * @author l0979365428 * */public class Commonutil {/** * Calculates the distance between two points * * @param element * Point 1 * @param Center * Point 2 * @return distance */public static float Distan CE (float[] element, float[] center) {Float distance = 0.0f;float x = element[0]-Center[0];float y = element[1]-center[ 1];float z = x * x + y * y;distance = (float) math.sqrt (z); return distance;} /** * Two-point error squared Method * * @param element * Point 1 * @param Center * Point 2 * @return error squared */public static float Errorsquare (float[] element, float[] center) {float x = element[0]-Center[0];float y = element[1]-Center[1];float err Square = x * x + y * y;return errsquare;} /** * Calculation error square and standard function method */public static float Countrule (list<float[]> cluster,float[] center) {Float JcF = 0;for (int j = 0; J < Cluster.size (); J + +) {JcF + = Commonutil.errorsquare (Cluster.get (j), center);} return JcF;} /** * Print data. Test Trial * * @param dataarray * DataSet * @param dataarrayname * DataSet name */public static void Printdataarray (List<float[]> dataarray, Strin G Dataarrayname) {for (int i = 0; i < dataarray.size (); i++) {System.out.println ("Print:" + dataarrayname + "[" + i + " ]={"+ dataarray.get (i) [0] +", "+ dataarray.get (i) [1] +"} ");} System.out.println ("===================================");}}
Package Org.algorithm;import Java.util.arraylist;import java.util.list;import java.util.random;/** * k mean clustering algorithm */public Class Kmeans {private int k;//divided into how many clusters private int m;//iterations private int datasetlength;//The number of dataset elements, that is, the length of the dataset private List<float []> dataset;//Data set linked list private list<float[]> center;//center linked list Private list<list<float[]>> cluster; Cluster private list<float> jc;//error squared sum, K closer to Datasetlength, the smaller the error private Random random;public static void Main (string[] args) {//Initializes a Kmean object, resets k to 10Kmeans k = new Kmeans (5);//Initialize test set arraylist<float[]> DataSet = new arraylist<float[ ]> ();d Ataset.add (new float[] {1, 2});d Ataset.add (new float[] {3, 3});d Ataset.add (new float[] {3, 4});d Ataset.add ( New float[] {5, 6});d Ataset.add (new float[] {8, 9});d Ataset.add (new float[] {4, 5});d Ataset.add (new float[] {6, 4} );d Ataset.add (new float[] {3, 9});d Ataset.add (new float[] {5, 9});d Ataset.add (new float[] {4, 2});d Ataset.add (new FL Oat[] {1, 9});d Ataset.add (new float[]{7, 8}); /Set Raw DataSet K.setdataset (DataSet);//Run Algorithm k.execute ();//Get cluster result list<list<float[]>> cluster = K.getcluster () ;//view results for (int i = 0; i < cluster.size (); i++) {Commonutil.printdataarray (Cluster.get (i), "cluster[" + i + "]");}} /** * Set the original dataset to be grouped * * @param dataset */public void Setdataset (list<float[]> dataset) {this.dataset = DataSet;} /** * Get Results grouped * * @return result set */public list<list<float[]>> Getcluster () {return cluster;} /** * constructor, the number of clusters that are required to be divided into * * @param k * cluster number, if k<=0, set to 1, if K is greater than the length of the data source, the length of the data source */public kmeans (int k) {if (k < ; = 0) {k = 1;} THIS.K = k;} /** * Initialize */private void init () {m = 0;random = new Random (); if (DataSet = = NULL | | dataset.size () = = 0) {initdataset ();} Datasetlength = Dataset.size (); if (K > datasetlength) {k = datasetlength;} Center = initcenters (); cluster = Initcluster (); JC = new arraylist<float> ();} /** * assumes that the caller has not initialized the dataset, using an internal test dataset */private void Initdataset () {dataSet = new arraylist<float[]> ();//{6,3} is the same, so the data set with a length of 15 is divided into 14 clusters and 15 clusters with an error of 0float[][] Datasetarray = new float[][] {{8, 2}, {3, 4}, {2, 5},{4, 2}, {7, 3}, {6, 2}, {4, 7}, {6, 3}, {5, 3},{6, 3}, {6, 9}, {1, 6}, {3, 9}, {4, 1}, {8, 6}};for (int i = 0; I < ; Datasetarray.length; i++) {Dataset.add (datasetarray[i]);}} /** * Initialize the central data link table, how many clusters will be divided into how many center points * * @return Center point set */private arraylist<float[]> initcenters () {Arraylist<float[]> ; Center = new Arraylist<float[]> (); int[] randoms = new Int[k];boolean Flag;int temp = random.nextint (datasetlength); Randoms[0] = temp;for (int i = 1; i < K; i++) {flag = True;while (flag) {temp = Random.nextint (datasetlength); int j = 0 while (J < i) {if (temp = = Randoms[j]) {break;} j + +;} if (j = = i) {flag = false;}} Randoms[i] = temp;} for (int i = 0; i < K; i++) {Center.add (Dataset.get (randoms[i]));//Generate Initialization center linked list}return Center;} /** * Initialize Cluster collection * * @return a cluster collection of empty data divided into K-clusters */private list<list<float[]>> initcluster () {list<list<float[] >> CLuster = new ArrayList (); for (int i = 0; i < K; i++) {Cluster.add (New arraylist<float[]> ());} return cluster;} /** * Gets the position of the minimum distance from the collection * * @param distance * Distance array * @return The minimum distance in the distance array */private int mindistance (float[] Dist ance) {Float mindistance = distance[0];int minlocation = 0;for (int i = 1; i < distance.length; i++) {if (Distance[i] & Lt mindistance) {mindistance = Distance[i];minlocation = i;} else if (distance[i] = = mindistance)//Assuming equality, randomly returns a position {if (random . Nextint (Ten) < 5) {minlocation = i;}}} return minlocation;} /** * Core, place the current element in the minimum distance center related cluster */private void Clusterset () {float[] distance = new Float[k];for (int i = 0; i < Datasetlen Gth i++) {for (int j = 0; J < K; J + +) {Distance[j] = commonutil.distance (Dataset.get (i), Center.get (j));} int minlocation = mindistance (distance); Cluster.get (minlocation). Add (Dataset.get (i));//core, placing the current element in a cluster with a minimum distance center}}/** * Calculation of squared sum of squares and criteria function method */private void Countrule () {Float JcF = 0;for (int i = 0; i < cluster.size (); i++) {foR (Int j = 0; J < Cluster.get (i). Size (); j + +) {JcF + = Commonutil.errorsquare (Cluster.get (i). Get (j), Center.get (i));} Jc.add (JcF);} /** * Set new cluster Center method */private void Setnewcenter () {for (int i = 0; i < K; i++) {int n = cluster.get (i). Size (); if (n! = 0) { Float[] Newcenter = {0, 0};for (int j = 0; J < N; j + +) {Newcenter[0] + = Cluster.get (i). Get (j) [0];newcenter[1] + = Clus Ter.get (i). Get (j) [1];} Set an average newcenter[0] = newcenter[0]/n;newcenter[1] = newcenter[1]/n;center.set (i, Newcenter);}}} Public list<float[]> Getcenter () {return center;} public void SetCenter (list<float[]> center) {this.center = center;} /** * Kmeans Algorithm core process method */private void Kmeans () {init ();//Loop grouping. Until the error is not changed while (true) {Clusterset (); Countrule (); if (M! = 0) {if (Jc.get (m)-jc.get (m-1) = = 0) {break;}} Setnewcenter (); M++;cluster.clear (); cluster = Initcluster ();}} /** * Run algorithm */public void execute () {Long startTime = System.currenttimemillis (); System.out.println ("Kmeans begins"); Kmeans (); Long endTime = SystEm.currenttimemillis (); System.out.println ("Kmeans running Time=" + (Endtime-starttime) + "MS"); System.out.println ("Kmeans ends"); System.out.println ();}}
Running two clustering algorithms separately makes the k=5 results such as the following:
Kmeans:
print:cluster[0]={5.0,6.0}print:cluster[1]={4.0,5.0}print:cluster[2]={6.0,4.0}================================ ===print:cluster[0]={1.0,2.0}print:cluster[1]={3.0,3.0}print:cluster[2]={3.0,4.0}print:cluster[3]={4.0,2.0}=== ================================print:cluster[0]={7.0,8.0}===================================print:cluster[0]= {8.0,9.0}===================================print:cluster[0]={3.0,9.0}print:cluster[1]={5.0,9.0}print:cluster[ 2]={1.0,9.0}===================================
Bisectingkmeans:
print:cluster0[0]={8.0,9.0}print:cluster0[1]={7.0,8.0}===================================print:cluster1[0]={ 3.0,4.0}print:cluster1[1]={5.0,6.0}print:cluster1[2]={4.0,5.0}print:cluster1[3]={6.0,4.0}===================== ==============print:cluster2[0]={1.0,2.0}print:cluster2[1]={3.0,3.0}print:cluster2[2]={4.0,2.0}=============== ====================print:cluster3[0]={1.0,9.0}===================================print:cluster4[0]={3.0,9.0} print:cluster4[1]={5.0,9.0}===================================
Please correct me if you have any understanding of the problem.
References:
http://blog.csdn.net/zouxy09/article/details/17590137
Http://wenku.baidu.com/link?url=e6sXeX_ Txpmnnnyy8w28mp-hsd2lk8cqgbw-4esipqu95r-p4ke2qpehlhfbtoie6agplav6vtvwxlyg-jf_5byhj_ce93arqa6u9rn6xkk
"Machine learning Combat"
Java implementation of two-point Kmeans