Java implementation of two-point Kmeans

Source: Internet
Author: User
Tags erro

Just studied the Kmeans. Kmeans is a very simple clustering algorithm. But he relies heavily on the user's first given k value. It is not possible to discover clusters of arbitrary shapes and sizes, and is best suited for discovering spherical clusters. His time complexity was O (TKN). The Kmeans algorithm has two core points: the formula for calculating distances & the criteria for judging the stopping of iterations. The general distance with European-style distances can be arbitrary. The criteria for judging an iterative stop can be:

1) Stop iteration if the center point of each cluster no longer changes

2) The sum of all clusters of points with the center point of the cluster (SSE) is no longer changed.

3) Set the artificial number of iterations and observe the experimental results.


Clustering works poorly when the initial cluster heart is not well chosen. So then another person proposed two K mean (Bisectingkmeans), the core idea is: the initial cluster in two to calculate the sum of squared error and the largest of the cluster, the second point to him. Until the number of clusters that are sliced is k-stop. Its essence is the continuous k=2 of the selected clusters to do the Kmeans segmentation.

Because the squared error of the cluster can measure the clustering performance, the smaller the value, the better the clustering effect is when the data points are close to their centroid. So we need to divide the sum of squared errors and the largest clusters, because the greater the squared error, the more the cluster clustering is more likely to be a cluster, so we first need to partition this cluster.


Here is the code, Kmeans the original code from http://blog.csdn.net/cyxlzzs/article/details/7416491, I made a few changes.


Package Org.algorithm;import java.util.arraylist;import java.util.list;/** * Binary k mean, is actually the Kmeans division of a set of k=2 several times, After each partition, the larger SSE value of the cluster will be two points. The number of clusters that eventually make or divide is k and stop * * Use a Kmeans Java implementation previously written by someone else as the base class. * * @author l0979365428 * */public class Bisectingkmeans {private int k;//divided into how many clusters private list<float[]> dataset;//when Before to be divided by the cluster private list<clusterset> cluster; Cluster/** * @param args */public static void main (string[] args) {//Initializes a Kmean object, resets K to 10BisectingKmeans BKM = new Bisectingk means (5);//Initialize test set arraylist<float[]> DataSet = new arraylist<float[]> ();d Ataset.add (new float[] {1, 2}); Dataset.add (New float[] {3, 3});d Ataset.add (new float[] {3, 4});d Ataset.add (new float[] {5, 6});d Ataset.add (New Floa T[] {8, 9});d Ataset.add (new float[] {4, 5});d Ataset.add (new float[] {6, 4});d Ataset.add (new float[] {3, 9});d Atase T.add (New float[] {5, 9});d Ataset.add (new float[] {4, 2});d Ataset.add (new float[] {1, 9});d Ataset.add (new float[] { 7, 8});//Set the original DataSet Bkm.setdataset (DATaset);//execute Algorithm bkm.execute ();//Get cluster result//arraylist<arraylist<float[]>> cluster = Bkm.getcluster ();//View results/ /for (int i = 0; i < cluster.size (); i++) {//Bkm.printdataarray (Cluster.get (i), "cluster[" + i + "]");//}}public Bis Ectingkmeans (int k) {//is smaller than 2 what is the meaning of the division if (K < 2) {k = 2;} THIS.K = k;} /** * Set the original dataset to be grouped * * @param dataset */public void Setdataset (arraylist<float[]> dataset) {this.dataset = DataSet;} /** * Execution Algorithm */public void execute () {Long startTime = System.currenttimemillis (); System.out.println ("Bisectingkmeans begins"); Bisectingkmeans (); Long endTime = System.currenttimemillis (); System.out.println ("Bisectingkmeans running Time=" + (Endtime-starttime) + "MS"); System.out.println ("Bisectingkmeans ends"); System.out.println ();} /** * Initialize */private void init () {int datasetlength = Dataset.size (), if (K > datasetlength) {k = datasetlength;}} /** * Initialize Cluster collection * * @return a cluster collection of empty data divided into K-clusters */private arraylist<arraylist<float[]>> initcluster () {arraylist<arraylist<float[]>> cluster = new arraylist<arraylist<float[]>> (); for (int i = 0; i < K; i++) {CLU Ster.add (New arraylist<float[]> ());} return cluster;} /** * Kmeans Algorithm core process method */private void Bisectingkmeans () {init (); if (K < 2) {///less than 2 is the output dataset is considered to be only a cluster of clusterset CS = new Clusterset (); cs.setclu (DataSet); Cluster.add (CS);} Call Kmeans for two minutes cluster = new ArrayList (), while (Cluster.size () < k) {list<clusterset> CLU = Kmeans (DataSet); Clusterset cl:clu) {Cluster.add (CL);} The IF (cluster.size () = = k) break;else//order calculates their error squared and {float maxerro=0f;int maxclustersetindex=0;int i=0;for (clusterset TT: Cluster) {//calculate the sum of squared errors and derive the sum of squared errors and the largest cluster float Erroe = Commonutil.countrule (Tt.getclu (), Tt.getcenter ()); Tt.seterro (Erroe); if (Maxerro<erroe) {maxerro=erroe;maxclustersetindex=i;} i++;} Dataset=cluster.get (Maxclustersetindex). GETCLU (); Cluster.remove (Maxclustersetindex);}} int i=0;for (Clusterset sc:cluster) {Commonutil.printdataarray (SC.GETCLU (), "cluster" +i); i++;}} /** * Call KmeansGet two clusters. * * @param dataSet * @return */private list<clusterset> Kmeans (list<float[]> DataSet) {Kmeans k = new Kmeans ( 2);//Set Raw DataSet K.setdataset (DataSet);//execute Algorithm k.execute ();//Get cluster result list<list<float[]>> Clus = K.getcluster () ; list<clusterset> clusterset = new arraylist<clusterset> (); int i = 0;for (list<float[]> cl:clus) {Clust Erset cs = new Clusterset (); Cs.setclu (CL) cs.setcenter (K.getcenter (). get (i)); Clusterset.add (CS); i++;} return clusterset;} Class Clusterset {private float erro;private list<float[]> clu;private float[] center;public float Geterro () {retur n Erro;} public void Seterro (float erro) {This.erro = Erro;} Public list<float[]> getclu () {return CLU;} public void Setclu (list<float[]> clu) {this.clu = CLU;} Public float[] Getcenter () {return center;} public void SetCenter (float[] center) {this.center = center;}}}

Package Org.algorithm;import java.util.list;/** * Draw out formulas for calculating distances and errors * @author l0979365428 * */public class Commonutil {/** * Calculates the distance between two points * * @param element * Point 1 * @param Center * Point 2 * @return distance */public static float Distan CE (float[] element, float[] center) {Float distance = 0.0f;float x = element[0]-Center[0];float y = element[1]-center[ 1];float z = x * x + y * y;distance = (float) math.sqrt (z); return distance;}  /** * Two-point error squared Method * * @param element * Point 1 * @param Center * Point 2 * @return error squared */public static float Errorsquare (float[] element, float[] center) {float x = element[0]-Center[0];float y = element[1]-Center[1];float err Square = x * x + y * y;return errsquare;} /** * Calculation error square and standard function method */public static float Countrule (list<float[]> cluster,float[] center) {Float JcF = 0;for (int j = 0; J < Cluster.size (); J + +) {JcF + = Commonutil.errorsquare (Cluster.get (j), center);} return JcF;}         /** * Print data, test with * * @param dataarray *   DataSet * @param dataarrayname * DataSet name */public static void Printdataarray (List<float[]> dataarray, Strin G Dataarrayname) {for (int i = 0; i < dataarray.size (); i++) {System.out.println ("Print:" + dataarrayname + "[" + i + " ]={"+ dataarray.get (i) [0] +", "+ dataarray.get (i) [1] +"} ");} System.out.println ("===================================");}}

Package Org.algorithm;import Java.util.arraylist;import java.util.list;import java.util.random;/** * k mean clustering algorithm */public Class Kmeans {private int k;//divided into how many clusters private int m;//iterations private int datasetlength;//The number of dataset elements, that is, the length of the dataset private List<float []> dataset;//Data set linked list private list<float[]> center;//center linked list Private list<list<float[]>> cluster; Cluster private list<float> jc;//error squared sum, K closer to Datasetlength, the smaller the error private Random random;public static void Main (string[] args) {//Initializes a Kmean object, resets k to 10Kmeans k = new Kmeans (5);//Initialize test set arraylist<float[]> DataSet = new arraylist<float[ ]> ();d Ataset.add (new float[] {1, 2});d Ataset.add (new float[] {3, 3});d Ataset.add (new float[] {3, 4});d Ataset.add ( New float[] {5, 6});d Ataset.add (new float[] {8, 9});d Ataset.add (new float[] {4, 5});d Ataset.add (new float[] {6, 4} );d Ataset.add (new float[] {3, 9});d Ataset.add (new float[] {5, 9});d Ataset.add (new float[] {4, 2});d Ataset.add (new FL Oat[] {1, 9});d Ataset.add (new float[]{7, 8}); /Set Raw DataSet K.setdataset (DataSet);//execute Algorithm k.execute ();//Get cluster result list<list<float[]>> cluster = K.getcluster () ;//view results for (int i = 0; i < cluster.size (); i++) {Commonutil.printdataarray (Cluster.get (i), "cluster[" + i + "]");}} /** * Set the original dataset to be grouped * * @param dataset */public void Setdataset (list<float[]> dataset) {this.dataset = DataSet;} /** * Get Results grouped * * @return result set */public list<list<float[]>> Getcluster () {return cluster;} /** * constructor, the number of clusters that need to be divided * * @param k * cluster number, if k<=0, set to 1, if K is greater than the length of the data source, the length of the data source */public kmeans (int k) {if (k &lt ; = 0) {k = 1;} THIS.K = k;} /** * Initialize */private void init () {m = 0;random = new Random (); if (DataSet = = NULL | | dataset.size () = = 0) {initdataset ();} Datasetlength = Dataset.size (); if (K > datasetlength) {k = datasetlength;} Center = initcenters (); cluster = Initcluster (); JC = new arraylist<float> ();} /** * If the caller does not initialize the dataset, the internal test dataset is used */private void Initdataset () {dataSet = new arraylist<float[]> ();//Where {6,3} is the same, so the data set with a length of 15 is divided into 14 clusters and 15 clusters with an error of 0float[][] Datasetarray = new float[][] {{8, 2}, {3, 4}, {2, 5},{4, 2}, {7, 3}, {6, 2}, {4, 7}, {6, 3}, {5, 3},{6, 3}, {6, 9}, {1, 6}, {3, 9}, {4, 1}, {8, 6}};for (int i = 0; I &lt ; Datasetarray.length; i++) {Dataset.add (datasetarray[i]);}} /** * Initialize the central data link table, how many clusters will be divided into how many center points * * @return Center point set */private arraylist<float[]> initcenters () {Arraylist<float[]&gt ; Center = new Arraylist<float[]> (); int[] randoms = new Int[k];boolean Flag;int temp = random.nextint (datasetlength); Randoms[0] = temp;for (int i = 1; i < K; i++) {flag = True;while (flag) {temp = Random.nextint (datasetlength); int j = 0 while (J < i) {if (temp = = Randoms[j]) {break;} j + +;} if (j = = i) {flag = false;}} Randoms[i] = temp;} for (int i = 0; i < K; i++) {Center.add (Dataset.get (randoms[i]));//Generate Initialization center linked list}return Center;} /** * Initialize Cluster collection * * @return a cluster collection of empty data divided into K-clusters */private list<list<float[]>> initcluster () {list<list<float[] >> CLuster = new ArrayList (); for (int i = 0; i < K; i++) {Cluster.add (New arraylist<float[]> ());} return cluster;} /** * Gets the position of the minimum distance from the collection * * @param distance * Distance array * @return The minimum distance in the distance array */private int mindistance (float[] Dist ance) {Float mindistance = distance[0];int minlocation = 0;for (int i = 1; i < distance.length; i++) {if (Distance[i] & Lt mindistance) {mindistance = Distance[i];minlocation = i;} else if (distance[i] = = mindistance)//If equal, randomly returns a position {if (random . Nextint (Ten) < 5) {minlocation = i;}}} return minlocation;} /** * Core, place the current element in the minimum distance center related cluster */private void Clusterset () {float[] distance = new Float[k];for (int i = 0; i < Datasetlen Gth i++) {for (int j = 0; J < K; J + +) {Distance[j] = commonutil.distance (Dataset.get (i), Center.get (j));} int minlocation = mindistance (distance); Cluster.get (minlocation). Add (Dataset.get (i));//core, placing the current element in a cluster with a minimum distance center}}/** * Calculation of squared sum of squares and criteria function method */private void Countrule () {Float JcF = 0;for (int i = 0; i < cluster.size (); i++) {foR (Int j = 0; J < Cluster.get (i). Size (); j + +) {JcF + = Commonutil.errorsquare (Cluster.get (i). Get (j), Center.get (i));} Jc.add (JcF);} /** * Set new cluster Center method */private void Setnewcenter () {for (int i = 0; i < K; i++) {int n = cluster.get (i). Size (); if (n! = 0) { Float[] Newcenter = {0, 0};for (int j = 0; J < N; j + +) {Newcenter[0] + = Cluster.get (i). Get (j) [0];newcenter[1] + = Clus Ter.get (i). Get (j) [1];} Set an average newcenter[0] = newcenter[0]/n;newcenter[1] = newcenter[1]/n;center.set (i, Newcenter);}}} Public list<float[]> Getcenter () {return center;} public void SetCenter (list<float[]> center) {this.center = center;} /** * Kmeans Algorithm core process method */private void Kmeans () {init ();//loop grouping until the error is unchanged while (true) {Clusterset (); Countrule (); if (M! = 0) {i F (Jc.get (m)-jc.get (m-1) = = 0) {break;}} Setnewcenter (); M++;cluster.clear (); cluster = Initcluster ();}} /** * Execution Algorithm */public void execute () {Long startTime = System.currenttimemillis (); System.out.println ("Kmeans begins"); Kmeans (); Long endTime = SystEm.currenttimemillis (); System.out.println ("Kmeans running Time=" + (Endtime-starttime) + "MS"); System.out.println ("Kmeans ends"); System.out.println ();}}

Two clustering algorithms are performed to make the k=5 result as follows:

Kmeans:

print:cluster[0]={5.0,6.0}print:cluster[1]={4.0,5.0}print:cluster[2]={6.0,4.0}================================ ===print:cluster[0]={1.0,2.0}print:cluster[1]={3.0,3.0}print:cluster[2]={3.0,4.0}print:cluster[3]={4.0,2.0}=== ================================print:cluster[0]={7.0,8.0}===================================print:cluster[0]= {8.0,9.0}===================================print:cluster[0]={3.0,9.0}print:cluster[1]={5.0,9.0}print:cluster[ 2]={1.0,9.0}===================================

Bisectingkmeans:
print:cluster0[0]={8.0,9.0}print:cluster0[1]={7.0,8.0}===================================print:cluster1[0]={ 3.0,4.0}print:cluster1[1]={5.0,6.0}print:cluster1[2]={4.0,5.0}print:cluster1[3]={6.0,4.0}===================== ==============print:cluster2[0]={1.0,2.0}print:cluster2[1]={3.0,3.0}print:cluster2[2]={4.0,2.0}=============== ====================print:cluster3[0]={1.0,9.0}===================================print:cluster4[0]={3.0,9.0} print:cluster4[1]={5.0,9.0}===================================

Please correct me if you have any understanding of the problem.



Reference documents:

http://blog.csdn.net/zouxy09/article/details/17590137

Http://wenku.baidu.com/link?url=e6sXeX_ Txpmnnnyy8w28mp-hsd2lk8cqgbw-4esipqu95r-p4ke2qpehlhfbtoie6agplav6vtvwxlyg-jf_5byhj_ce93arqa6u9rn6xkk

"Machine learning Combat"

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Java implementation of two-point Kmeans

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.