Explore the secrets of the recommended engine, part 3rd: In-depth recommendation engine-related algorithms-Clustering (ii)

Source: Internet
Author: User

K-mean-value clustering algorithm

K-Means is a typical distance-based exclusion method: Given a data set of N objects, it can construct a K-partition of the data, each partition is a cluster, and k<=n, but also need to meet two requirements:

    • Each group contains at least one object

    • Each object must belong to and belong to only one group.

The basic principle of K-means is that, given K, the number of divisions to be constructed,

    1. Start by creating an initial partition, randomly selecting K objects, each of which initially represents a cluster center. For other objects, assign them to the nearest cluster, depending on their distance from the center of each cluster.

    2. Then, an iterative relocation technique is used to try to improve the partitioning by moving objects between divisions. The so-called relocation technique is when a new object is added to a cluster or an existing object leaves the cluster, the average of the cluster is recalculated, and the object is redistributed. The process repeats until there are no changes to the objects in the cluster.

When the result cluster is dense, and the difference between clusters and clusters is more obvious, the K mean effect is better. For processing large data sets, the algorithm is relatively scalable and efficient, its complexity is O (NKT), n is the number of objects, K is the numbers of clusters, T is the number of iterations, usually k<<n, and t<<n, so the algorithm is often the local optimal end.

The biggest problem of K mean is that the user must give the number of k beforehand, the choice of K is generally based on some empirical values and multiple experimental results, for different data sets, K value is not available for reference. In addition, the K-means are sensitive to "noise" and outlier data, and a small amount of such data can have a significant impact on the average value.

Having said so many theories, here's an example of a simple K-mean algorithm based on Mahout. As described earlier, Mahout provides basic memory-based implementations and Hadoop-based map/reduce implementations, respectively Kmeansclusterer and Kmeansdriver, with a simple example, based on our listing 1 The two-dimensional point set data defined in the

Listing 3. K-Mean Clustering algorithm example
  Memory-based  K  mean clustering algorithm for  public static void kmeansclusterinmemorykmeans () {   //  specify the number of clusters to be clustered, select  2  class  int k = 2;  //  Specify  K  The maximum number of iterations of the mean clustering algorithm  int maxIter = 3;  //  the maximum distance threshold value for the specified  K  mean clustering algorithm  double  distanceThreshold = 0.01;  //  declares a method for calculating distances, where Euclidean distance  distancemeasure is chosen.  measure = new euclideandistancemeasure ();  //  here constructs a vector set, using the list  1  In the two-dimensional point set  list<vector> pointvectors = simpledataset.getpointvectors ( simpledataset.points);  //  random selection from a point set vector  k  as the center of the cluster  List<Vector>  Randompoints = randomseedgenerator.chooserandompoints (pointvectors, k);  //  Base build cluster  List<Cluster> clusters = new ArrayList<Cluster> ();  based on the previously selected center  int clusterid = 0;  for (vector v : randompoints) {  clusters.add (New cluster (v, clusterId ++ ,  measure));  }  //  call  KMeansClusterer.clusterPoints  method Execution  K  Mean-value Clustering  list<list<cluster>> finalclusters = kmeansclusterer.clusterpoints ( Pointvectors,  clusters, measure, maxiter, distancethreshold);  //  Print the final cluster result  for (Cluster cluster : finalclusters.get (Finalclusters.size ()  -1)) {   system.out.println ("cluster id: "  + cluster.getid ()  +  " center:  " + cluster.getcenter (). asformatstring ());   system.out.println ("         Points:  " + cluster.getnumpoints ());   }  }   //   Hadoop -based  K  mean clustering algorithm implementation  public static void  kmeansclusterusingmapreduce  ()  throws exception{  //  declares a method of calculating distances where Euclidean distance is chosen  distancemeasure measure =  new euclideandistancemeasure ();  //  Specifies the input path, as described earlier, based on  Hadoop  Implementation is to specify the data source by specifying the file path of the input and output.  path testpoints = new path ("testpoints");  path output =  New path ("Output");  //  empties the data  hadooputil.overwriteoutput (testpoints);  the input/output path  hadooputil.overwriteoutput (output);   randomutils.usetestseed ();  //  generates a point set under the input path, Unlike the memory method, where all the vectors need to be written into the file, here is a concrete example  simpledataset.writepointstofile (testpoints);  //  Specify the number of clusters to be clustered, select the maximum number of iterations for the  2  class  int k = 2;  //  specify  K  mean Clustering algorithm  int maxIter = 3;  //  Specify the maximum distance threshold for the  K  mean clustering algorithm  double  distancethreshold = 0.01;  //  Random Selection  k  The center of a cluster  Path clusters  = randomseedgeneraTor.buildrandom (Testpoints,  new path (output,  "clusters-0"),  k, measure);   //  call  KMeansDriver.runJob  method execution  K  mean clustering algorithm  kmeansdriver.runjob (testpoints,  clusters, output, measure,  distancethreshold, maxiter, 1, true,  true);  //  Call the  printClusters  method of  ClusterDumper  to print out the clustering results.  clusterdumper clusterdumper = new clusterdumper (New path (output,  " clusters-" + maxiter -1),  new path (output, " clusteredpoints "));   Clusterdumper.printclusters (null);  }  //simpledataset   writePointsToFile  method to write the test point set to the file  //  first we wrap the test point set in  VectorWritable  form to write them to the file  public static  List<vectorwritable> getpoints (Double[][] raw)  {  List<VectorWritable>  points = new arraylist<vectOrwritable> ();  for  (int i = 0; i < raw.length; i++)  {  double[] fr = raw[i];  Vector vec = new  Randomaccesssparsevector (fr.length);   vec.assign (FR);  //  just before joining the point set, in   randomaccesssparsevector  plus a layer of  VectorWritable  packaging  points.add (new vectorwritable (VEC)) ;  }  return points;  }  //  will  VectorWritable  Point set to write to the file, here are some basic  Hadoop  programming elements, please refer to the relevant content in the reference resources  public static void  Writepointstofile (path output)  throws IOException {  //  Call the previous method to generate the point set   List<vectorwritable> pointvectors = getpoints (points);  //  settings  hadoop Basic configuration of    configuration conf = new configuration ();  //  generation   hadoop  File System Object  filesystem  filesysteM fs = filesystem.get (Output.touri (),  conf);  //  generate a   Sequencefile.writer, which is responsible for writing  Vector  to files in  SequenceFile.Writer writer = new  Sequencefile.writer (Fs, conf, output,  text.class,  vectorwritable.class);   //  here to write the vector as text file  try {  for  (vectorwritable vw :  pointvectors)  {  writer.append (New text (), &NBSP;VW);  }  }  Finally {  writer.close ();  }   }  Executive Results  KMeans  clustering in memory result  cluster id: 0  center:{"class": " Org.apache.mahout.math.RandomAccessSparseVector "," vector ":" {\ "values\": {\ "table\": [0,1,0],\ "values\": [ 1.8,1.8,0.0],\ "state\": [1,1,0],\ "freeentries\": 1,\ "distinct\": 2,\ "Lowwatermark\": 0,\ "Highwatermark\": 1,\ " Minloadfactor\ ": 0.2,\" maxloadfactor\ ": 0.5},\" size\ ": 2,\" lengthsquared\": -1.0}"}        points: 5  cluster id: 1   center:{"class": "Org.apache.mahout.math.RandomAccessSparseVector",  "vector": "{\" values\ ": {\" Table\ ": [0,1,0], \" values\ ": [7.142857142857143,7.285714285714286,0.0],\" state\ ": [1,1,0], \] Freeentries\ ": 1,\" distinct\ ": 2,\" Lowwatermark\ ": 0,\" Highwatermark\ ": 1, \" minloadfactor\ ": 0.2,\" Maxloadfactor\ ": 0.5},\" size\ ": 2,\" lengthsquared\ ": -1.0}"}         Points: 7  kmeans clustering using map/reduce result  weight:   Point:  1.0: [1.000, 1.000]  1.0: [2.000, 1.000]   1.0: [1.000, 2.000]  1.0: [2.000, 2.000]  1.0: [3.000,  3.000]  weight:  point:  1.0: [8.000, 8.000]  1.0:  [9.000, 8.000]  1.0: [8.000, 9.000]  1.0: [9.000, 9.000]  1.0: [5.000, 5.000]  1.0: &NBSP;[5.000,&NBSP;6.000]&NBSP;&NBSP;1.0:&NBSP;[6.000,&NBSP;6.000]

After introducing the K-means clustering algorithm, we can see that its greatest advantage is that the principle is simple, the implementation is relatively simple, while the execution efficiency and the scalability of the large data volume is still strong. However, the drawback is also very clear, first of all, it requires the user before the implementation of clustering has a definite number of clusters set, this is the user in dealing with most of the problems are not likely to know in advance, generally need to find an optimal K value through many experiments, and then, The algorithm is less tolerant to noise and outliers, since the algorithm initially adopts the method of randomly selecting the initial clustering center. Noise is the wrong data in a clustered object, and outliers are data that is far away from other data and less similar. For the K-means algorithm, once the outlier and noise are selected as the cluster center at the very beginning, the whole clustering process will bring a lot of problems, then we can quickly find out how many clusters should be selected and find the center of the cluster, which can greatly optimize the efficiency of K-means clustering algorithm. Here we introduce another clustering method: Canopy Clustering algorithm.

Explore the secrets of the recommended engine, part 3rd: In-depth recommendation engine-related algorithms-Clustering (ii)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.