The sixth chapter of Mahout in action.
Datafile/cluster/simple_k-means.txt datasets such as the following:
1 12 11 22 23 38 88 99 89 9
1. K-means Clustering Algorithm principle
1. k elements are randomly taken from d. As the individual centers of the K-clusters.
2. Calculate the difference between the remaining elements and the center of k clusters, respectively, and assign these elements to clusters with the lowest degree of dissimilarity.
3, according to cluster results. Once again, the centers of the K clusters are computed by the arithmetic averages of the respective dimensions of all the elements in the cluster.
4. All elements in D are clustered again according to the new center.
5, repeat the 4th step, until the cluster results no longer change.
6, output the result.
2. Illustrative examples
2.1 Randomly take k elements from D, as the respective centers of the K clusters. Private final static Integer k=2; Choose K=2, which is an estimate of two clusters.
Choose 1 1,2,1 two points below.
C0:1 1
C1:2 1
2.2 Calculates the divergence of the remaining elements to the center of the k cluster, respectively, and classifies the elements into clusters with the lowest degree of dissimilarity. The result is:
C0:1 1C0: The point is: 1.0,2.0c1:2 1C1: The point is: 2.0,2.0c1: The point is: 3.0,3.0c1: The point is: 8.0,8.0c1:8.0,9.0c1: The point is: 9.0,8.0c1: The point is: 9.0, 9.0
2.3 According to the clustering result of 2.2. Once again, the centers of the K clusters are computed by the arithmetic averages of the respective dimensions of all the elements in the cluster.
Take the Euclidean distance formula. C0 The new cluster Heart is: 1.0,1.5
C1 The new cluster Heart is: 5.857142857142857,5.714285714285714
2.4 All elements in D are clustered again according to the new center.
The 2nd Iteration C0:1.0,1.0c0: The point is: 2.0,1.0c0: The point is: 1.0,2.0c0: The point is: 2.0,2.0c0: The point is: 3.0,3.0c1: The point is: 8.0,8.0c1: The point is: 8.0,9.0c1: The point is: 9.0,8.0c1: The point is: 9.0,9.0
2.5 Repeat the 4th step until the cluster result no longer changes. When the distance is less than a certain value. I think the cluster has been clustered. No need to iterate, the value here is 0.001 Private final static Double converge=0.001;
The cluster heart of the------------------------------------------------C0 is: The 1.6666666666666667,1.75C1 cluster heart is: 7.971428571428572, 7.942857142857143 the minimum distance for each cluster heart movement is, move=0.7120003121097943 3rd iteration C0: The point is: 1.0,1.0c0: The point is: 2.0,1.0c0: The point is: 1.0,2.0c0: The point is: 2.0,2.0c0: The point is: 3.0,3.0c1: The point is: 8.0,8.0c1: The point is: 8.0,9.0c1: The point is: 9.0,8.0c1: The point is: 9.0,9.0----------------------------- The cluster heart of the-------------------C0 is: The 1.777777777777778,1.7916666666666667C1 cluster heart is: 8.394285714285715, 8.388571428571428 the minimum distance for each cluster heart movement is. move=0.11866671868496578 4th Iteration C0: The point is: 1.0,1.0c0: The point is: 2.0,1.0c0: The point is: 1.0,2.0c0: The point is: 2.0,2.0c0: The point is: 3.0,3.0c1: The point is: 8.0, 8.0C1: The point is: 8.0,9.0c1: The point is: 9.0,8.0c1: The point is: 9.0,9.0------------------------------------------------ C0 cluster Heart is: 1.7962962962962965,1.7986111111111114c1 cluster heart is: 8.478857142857143,8.477714285714285 Each cluster heart movement the smallest distance is, move= 0.019777786447494432 5th Iteration C0: The point is: 1.0,1.0c0:2.0,1.0c0: The point is: 1.0,2.0c0: The point is: 2.0,2.0c0: The point is: 3.0,3.0c1: The point is: 8.0, 8.0C1: The point is: 8.0,9.0c1: The point is: 9.0,8.0c1: The point is: 9.0,9.0------------------------------------------------ The cluster heart of C0 is: 1.799382716049383,1.7997685185185184c1:8.495771428571429,8.495542857142857 the minimum distance for each cluster heart movement. move=0.003296297741248916 6th Iteration C0: The point is: 1.0,1.0c0: The point is: 2.0,1.0c0: The point is: 1.0,2.0c0: The point is: 2.0,2.0c0: The point is: 3.0,3.0c1: The point is: 8.0 , 8.0C1: The point is: 8.0,9.0c1: The point is: 9.0,8.0c1: The point is: 9.0,9.0------------------------------------------------ C0 's Cluster Heart is: 1.7998971193415638,1.7999614197530864C1 's cluster Heart is: 8.499154285714287,8.499108571428572 Each cluster heart movement the smallest distance is. Move=5.49382956874724e-4
3. Java implementation
Package Mysequence.machineleaning.clustering.kmeans;import Java.io.bufferedreader;import Java.io.FileInputStream; Import Java.io.ioexception;import java.io.inputstreamreader;import java.util.arraylist;import java.util.List; Import Java.util.vector;import Mysequence.machineleaning.clustering.canopy.point;public class MyKmeans {static Vector<point> li=new vector<point> ();//static list<point> li=new ArrayList<Point> (); static List<vector<point>> list=new arraylist<vector<point>> (); Each iteration saves the result, and a vector represents a cluster private final static Integer k=2; Choose K=2, which is an estimate of two clusters. Private final static Double converge=0.001; When the distance is less than a certain value. It is thought that the cluster has been clustered, no need to iterate, here the value of the 0.001//read data public static final void ReadF1 () throws IOException {String filepath= "datafile/cl Uster/simple_k-means.txt "; BufferedReader br = new BufferedReader (new InputStreamReader (New FileInputStream (FilePath))); for (String line = Br.readline (), line = null, line = Br.readline ()) { if (Line.length () ==0| | "". Equals (line)) continue; String[] Str=line.split (""); Point P0=new Point (); P0.setx (double.valueof (str[0)); P0.sety (double.valueof (str[1)); Li.add (P0); System.out.println (line); } br.close (); }//math.sqrt (double n)//extended. Suppose you want to give m n times to use Java.lang.StrictMath.pow (m,1.0/n);//use Euclidean distance public static double Distancemeasure (Point p1,point p2) {double Tmp=strictmath.pow (P2.getx ()-p1.getx (), 2) +strictmath.pow (P2.gety ()-p1.gety (), 2); return MATH.SQRT (TMP); Calculates the new cluster heart public static Double calcentroid () {System.out.println ("------------------------------------------------") ;D ouble movedist=double.max_value;for (int i=0;i<list.size (); i++) {vector<point> subli=list.get (i); Point Po=new Point ();D ouble sumx=0.0;double sumy=0.0;double clusterlen=double.valueof (Subli.size ()), for (int j=0;j <clusterlen;j++) {point nextp=subli.get (j); Sumx=sumx+nextp.getx (); Sumy=sumy+nextp.gety ();} Po.setx (SumX/clusterlen);p o.sety (Sumy/clusterlen);//The distance between the new point and the old Point double dist=distancemeasure (subli.get (0), PO);//In the process of moving multiple clusters of cores, Returns the value of the minimum moving distance if (dist<movedist) movedist=dist;list.get (i). Clear (); List.get (i). Add (PO); System.out.println ("C" +i+ "The Cluster Heart is:" +po.getx () + "," +po.gety ());} String test= "ll"; return movedist;} This time the cluster heart//Next moving cluster heart private static Double move=double.max_value;//move distance//iterate continuously until the public static void Recursionkluster () {for (int times=2;move>converge;times++) {System.out.println ("+times+");//default vector for each list No. 0 element is centroid for ( int i=0;i<li.size (); i++) {point p=new Point (); P=li.get (i); int index =-1; Double neardist = double.max_value;for (int k=0;k<k;k++) {point centre=list.get (k). Get (0);d ouble currentdist= Distancemeasure (P,centre); if (currentdist<neardist) {neardist=currentdist;index=k;}} System.out.println ("C" +index+ ": The point is:" +p.getx () + "," +p.gety ()) ", List.get (Index). Add (P);} Compute the cluster heart again, and return the moving distance, the smallest distance move=calcentroid (); System.out.println ("The smallest distance in each cluster heart movement. Move= "+move);}} public static void Kluster () {for (int k=0;k<k;k++) {vector<point> vect=new vector<point> (); Point P=new Point ();p =li.get (k); Vect.add (P); List.add (Vect);} System.out.println ("1th iteration");//default vector for each list the No. 0 element is centroid for (int i=k;i<li.size (); i++) {point p=new Point (); p= Li.get (i); int index =-1; Double neardist = double.max_value;for (int k=0;k<k;k++) {point centre=list.get (k). Get (0);d ouble currentdist= Distancemeasure (P,centre); if (currentdist<neardist) {neardist=currentdist;index=k;}} System.out.println ("C" +index+ ": The point is:" +p.getx () + "," +p.gety ()) ", List.get (Index). Add (P);}} public static void Main (string[] args) throws IOException {//TODO auto-generated method stub//read Data readF1 ();//First Iteration Kluste R ();//The Cluster Heart calcentroid () is computed after the first iteration;//iteration continues until convergence Recursionkluster ();}}
4. Execution Result: c0:1 1
C1:2 1
1th Iteration
C0: The point is: 1.0,2.0
C1: The point is: 2.0,2.0
C1: The point is: 3.0,3.0
C1: The point is: 8.0,8.0
C1: The point is: 8.0,9.0
C1: The point is: 9.0,8.0
C1: The point is: 9.0,9.0
------------------------------------------------
C0 's Cluster Heart is: 1.0,1.5
C1 's Cluster Heart is: 5.857142857142857,5.714285714285714
2nd Iteration
C0: The point is: 1.0,1.0
C0: The point is: 2.0,1.0
C0: The point is: 1.0,2.0
C0: The point is: 2.0,2.0
C0: The point is: 3.0,3.0
C1: The point is: 8.0,8.0
C1: The point is: 8.0,9.0
C1: The point is: 9.0,8.0
C1: The point is: 9.0,9.0
------------------------------------------------
C0 's Cluster Heart is: 1.6666666666666667,1.75
C1 's Cluster Heart is: 7.971428571428572,7.942857142857143
The minimum distance for each cluster heart movement is, move=0.7120003121097943
3rd Iteration
C0: The point is: 1.0,1.0
C0: The point is: 2.0,1.0
C0: The point is: 1.0,2.0
C0: The point is: 2.0,2.0
C0: The point is: 3.0,3.0
C1: The point is: 8.0,8.0
C1: The point is: 8.0,9.0
C1: The point is: 9.0,8.0
C1: The point is: 9.0,9.0
------------------------------------------------
C0 's Cluster Heart is: 1.777777777777778,1.7916666666666667
C1 's Cluster Heart is: 8.394285714285715,8.388571428571428
The minimum distance for each cluster heart movement is. move=0.11866671868496578
4th Iteration
C0: The point is: 1.0,1.0
C0: The point is: 2.0,1.0
C0: The point is: 1.0,2.0
C0: The point is: 2.0,2.0
C0: The point is: 3.0,3.0
C1: The point is: 8.0,8.0
C1: The point is: 8.0,9.0
C1: The point is: 9.0,8.0
C1: The point is: 9.0,9.0
------------------------------------------------
C0 's Cluster Heart is: 1.7962962962962965,1.7986111111111114
C1 's Cluster Heart is: 8.478857142857143,8.477714285714285
The minimum distance for each cluster heart movement is. move=0.019777786447494432
5th Iteration
C0: The point is: 1.0,1.0
C0: The point is: 2.0,1.0
C0: The point is: 1.0,2.0
C0: The point is: 2.0,2.0
C0: The point is: 3.0,3.0
C1: The point is: 8.0,8.0
C1: The point is: 8.0,9.0
C1: The point is: 9.0,8.0
C1: The point is: 9.0,9.0
------------------------------------------------
C0 's Cluster Heart is: 1.799382716049383,1.7997685185185184
C1 's Cluster Heart is: 8.495771428571429,8.495542857142857
The minimum distance for each cluster heart movement is. move=0.003296297741248916
6th Iteration
C0: The point is: 1.0,1.0
C0: The point is: 2.0,1.0
C0: The point is: 1.0,2.0
C0: The point is: 2.0,2.0
C0: The point is: 3.0,3.0
C1: The point is: 8.0,8.0
C1: The point is: 8.0,9.0
C1: The point is: 9.0,8.0
C1: The point is: 9.0,9.0
------------------------------------------------
C0 's Cluster Heart is: 1.7998971193415638,1.7999614197530864
C1 's Cluster Heart is: 8.499154285714287,8.499108571428572
The minimum distance for each cluster heart movement is. Move=5.49382956874724e-4
K-means Clustering Java instances