K-means algorithm is a typical distance-based clustering algorithm, using distance as the evaluation index of similarity, the closer the distance of two objects, the greater the similarity. The K-means algorithm considers clusters to be composed of objects that are close to each other, and therefore obtains a compact and independent cluster as the ultimate goal.
K-means clustering requires the user to set a cluster number k as input data. The selection of the center point of k initial clustering has a great effect on the clustering result. In order to achieve high-quality clustering with K-means, the K value needs to be estimated. The k value can be estimated according to the number of clusters required.
For example, 1 million articles, if the average 500 are divided into one category, K value can take 2000 (1 million/500).
Algorithm steps
1) randomly selects any K object as the initial cluster center, initially representing a cluster;
2) Calculate the distance from the point to the centroid and classify it to the nearest centroid class;
3) Recalculate the centroid of the various classes that have been obtained;
4) Iterate through the new centroid to the original centroid equal to or less than the specified threshold, the algorithm ends.
Typical of EM
This two-step algorithm is a typical example of the maximum expectation algorithm (EM), the first step is to calculate the expectation (E), use the existing estimates of the hidden variables, calculate its maximum likelihood estimate; the second step is to maximize (M) and maximize the maximum likelihood value calculated on the E step to calculate the value of the parameter. The parameter estimates found on M-step are used in the next e-step calculation, and the process is constantly alternating.
Run K-means Clustering
K-mean clustering is used in Kmeansclusterer or Kmeansdriver classes, the previous one is clustering nodes in memory (In-memory), which is performed using the MapReduce task.
Both of these methods can: Read and write data from a disk, just like a normal Java program. You can also perform clustering on Hadoop by reading and writing data through a distributed file system. Test run Kmeansexample First, using a random point generation function to create some points. These points, which generate vector-formatted points, appear as normal distributions around a center. You can use the Kmeansclusterer clustering method to cluster these points.
Two types of catalogs for K-means clustering output
clusters-* directory: Generated at the end of each iteration, the clusters-0 directory is generated after the 1th iteration, the Clusters-1 directory is generated after the 2nd iteration, and so on. Contains information about the cluster: center, standard, etc.
Clusteredpoints directory: Contains the final mapping from the cluster ID to the document ID. is generated based on the output of the last MapReduce operation.
Why is K-means a distance-based clustering algorithm?