K-mean-value clustering algorithm
K-Means is a typical distance-based exclusion method: Given a data set of N objects, it can construct a K-partition of the data, each partition is a cluster, and k<=n, but also need to meet two requirements:
The basic principle of K-means is that, given K, the number of divisions to be constructed,
Start by creating an initial partition, randomly selecting K objects, each of which initially represents a cluster center. For other objects, assign them to the nearest cluster, depending on their distance from the center of each cluster.
Then, an iterative relocation technique is used to try to improve the partitioning by moving objects between divisions. The so-called relocation technique is when a new object is added to a cluster or an existing object leaves the cluster, the average of the cluster is recalculated, and the object is redistributed. The process repeats until there are no changes to the objects in the cluster.
When the result cluster is dense, and the difference between clusters and clusters is more obvious, the K mean effect is better. For processing large data sets, the algorithm is relatively scalable and efficient, its complexity is O (NKT), n is the number of objects, K is the numbers of clusters, T is the number of iterations, usually k<<n, and t<<n, so the algorithm is often the local optimal end.
The biggest problem of K mean is that the user must give the number of k beforehand, the choice of K is generally based on some empirical values and multiple experimental results, for different data sets, K value is not available for reference. In addition, the K-means are sensitive to "noise" and outlier data, and a small amount of such data can have a significant impact on the average value.
Having said so many theories, here's an example of a simple K-mean algorithm based on Mahout. As described earlier, Mahout provides basic memory-based implementations and Hadoop-based map/reduce implementations, respectively Kmeansclusterer and Kmeansdriver, with a simple example, based on our listing 1 The two-dimensional point set data defined in the
Listing 3. K-Mean Clustering algorithm example
Memory-based K mean clustering algorithm for public static void kmeansclusterinmemorykmeans () { // specify the number of clusters to be clustered, select 2 class int k = 2; // Specify K The maximum number of iterations of the mean clustering algorithm int maxIter = 3; // the maximum distance threshold value for the specified K mean clustering algorithm double distanceThreshold = 0.01; // declares a method for calculating distances, where Euclidean distance distancemeasure is chosen. measure = new euclideandistancemeasure (); // here constructs a vector set, using the list 1 In the two-dimensional point set list<vector> pointvectors = simpledataset.getpointvectors ( simpledataset.points); // random selection from a point set vector k as the center of the cluster List<Vector> Randompoints = randomseedgenerator.chooserandompoints (pointvectors, k); // Base build cluster List<Cluster> clusters = new ArrayList<Cluster> (); based on the previously selected center int clusterid = 0; for (vector v : randompoints) { clusters.add (New cluster (v, clusterId ++ , measure)); } // call KMeansClusterer.clusterPoints method Execution K Mean-value Clustering list<list<cluster>> finalclusters = kmeansclusterer.clusterpoints ( Pointvectors, clusters, measure, maxiter, distancethreshold); // Print the final cluster result for (Cluster cluster : finalclusters.get (Finalclusters.size () -1)) { system.out.println ("cluster id: " + cluster.getid () + " center: " + cluster.getcenter (). asformatstring ()); system.out.println (" Points: " + cluster.getnumpoints ()); } } // Hadoop -based K mean clustering algorithm implementation public static void kmeansclusterusingmapreduce () throws exception{ // declares a method of calculating distances where Euclidean distance is chosen distancemeasure measure = new euclideandistancemeasure (); // Specifies the input path, as described earlier, based on Hadoop Implementation is to specify the data source by specifying the file path of the input and output. path testpoints = new path ("testpoints"); path output = New path ("Output"); // empties the data hadooputil.overwriteoutput (testpoints); the input/output path hadooputil.overwriteoutput (output); randomutils.usetestseed (); // generates a point set under the input path, Unlike the memory method, where all the vectors need to be written into the file, here is a concrete example simpledataset.writepointstofile (testpoints); // Specify the number of clusters to be clustered, select the maximum number of iterations for the 2 class int k = 2; // specify K mean Clustering algorithm int maxIter = 3; // Specify the maximum distance threshold for the K mean clustering algorithm double distancethreshold = 0.01; // Random Selection k The center of a cluster Path clusters = randomseedgeneraTor.buildrandom (Testpoints, new path (output, "clusters-0"), k, measure); // call KMeansDriver.runJob method execution K mean clustering algorithm kmeansdriver.runjob (testpoints, clusters, output, measure, distancethreshold, maxiter, 1, true, true); // Call the printClusters method of ClusterDumper to print out the clustering results. clusterdumper clusterdumper = new clusterdumper (New path (output, " clusters-" + maxiter -1), new path (output, " clusteredpoints ")); Clusterdumper.printclusters (null); } //simpledataset writePointsToFile method to write the test point set to the file // first we wrap the test point set in VectorWritable form to write them to the file public static List<vectorwritable> getpoints (Double[][] raw) { List<VectorWritable> points = new arraylist<vectOrwritable> (); for (int i = 0; i < raw.length; i++) { double[] fr = raw[i]; Vector vec = new Randomaccesssparsevector (fr.length); vec.assign (FR); // just before joining the point set, in randomaccesssparsevector plus a layer of VectorWritable packaging points.add (new vectorwritable (VEC)) ; } return points; } // will VectorWritable Point set to write to the file, here are some basic Hadoop programming elements, please refer to the relevant content in the reference resources public static void Writepointstofile (path output) throws IOException { // Call the previous method to generate the point set List<vectorwritable> pointvectors = getpoints (points); // settings hadoop Basic configuration of configuration conf = new configuration (); // generation hadoop File System Object filesystem filesysteM fs = filesystem.get (Output.touri (), conf); // generate a Sequencefile.writer, which is responsible for writing Vector to files in SequenceFile.Writer writer = new Sequencefile.writer (Fs, conf, output, text.class, vectorwritable.class); // here to write the vector as text file try { for (vectorwritable vw : pointvectors) { writer.append (New text (), &NBSP;VW); } } Finally { writer.close (); } } Executive Results KMeans clustering in memory result cluster id: 0 center:{"class": " Org.apache.mahout.math.RandomAccessSparseVector "," vector ":" {\ "values\": {\ "table\": [0,1,0],\ "values\": [ 1.8,1.8,0.0],\ "state\": [1,1,0],\ "freeentries\": 1,\ "distinct\": 2,\ "Lowwatermark\": 0,\ "Highwatermark\": 1,\ " Minloadfactor\ ": 0.2,\" maxloadfactor\ ": 0.5},\" size\ ": 2,\" lengthsquared\": -1.0}"} points: 5 cluster id: 1 center:{"class": "Org.apache.mahout.math.RandomAccessSparseVector", "vector": "{\" values\ ": {\" Table\ ": [0,1,0], \" values\ ": [7.142857142857143,7.285714285714286,0.0],\" state\ ": [1,1,0], \] Freeentries\ ": 1,\" distinct\ ": 2,\" Lowwatermark\ ": 0,\" Highwatermark\ ": 1, \" minloadfactor\ ": 0.2,\" Maxloadfactor\ ": 0.5},\" size\ ": 2,\" lengthsquared\ ": -1.0}"} Points: 7 kmeans clustering using map/reduce result weight: Point: 1.0: [1.000, 1.000] 1.0: [2.000, 1.000] 1.0: [1.000, 2.000] 1.0: [2.000, 2.000] 1.0: [3.000, 3.000] weight: point: 1.0: [8.000, 8.000] 1.0: [9.000, 8.000] 1.0: [8.000, 9.000] 1.0: [9.000, 9.000] 1.0: [5.000, 5.000] 1.0: &NBSP;[5.000,&NBSP;6.000]&NBSP;&NBSP;1.0:&NBSP;[6.000,&NBSP;6.000]
After introducing the K-means clustering algorithm, we can see that its greatest advantage is that the principle is simple, the implementation is relatively simple, while the execution efficiency and the scalability of the large data volume is still strong. However, the drawback is also very clear, first of all, it requires the user before the implementation of clustering has a definite number of clusters set, this is the user in dealing with most of the problems are not likely to know in advance, generally need to find an optimal K value through many experiments, and then, The algorithm is less tolerant to noise and outliers, since the algorithm initially adopts the method of randomly selecting the initial clustering center. Noise is the wrong data in a clustered object, and outliers are data that is far away from other data and less similar. For the K-means algorithm, once the outlier and noise are selected as the cluster center at the very beginning, the whole clustering process will bring a lot of problems, then we can quickly find out how many clusters should be selected and find the center of the cluster, which can greatly optimize the efficiency of K-means clustering algorithm. Here we introduce another clustering method: Canopy Clustering algorithm.
Explore the secrets of the recommended engine, part 3rd: In-depth recommendation engine-related algorithms-Clustering (ii)