Mahout Series: Kmeans cluster

Source: Internet
Author: User
Tags range

Kmeans is one of the most classical clustering algorithms, which is widely used for its graceful simplicity, fast and high efficiency.

Kmeans Algorithm Description

Input: Number of clusters K, DataSet D containing N objects.

Output: A collection of k clusters.

Method:

Arbitrarily selecting K objects from D as the initial cluster center;

Repeat

Each object is assigned to the most similar cluster according to the mean value of the objects in the cluster;

Updating the mean value of the cluster, that is, calculating the mean value of the objects in each cluster;

calculation criterion function;

The until benchmark function is not changed.

Advantages and disadvantages of Kmeans algorithm:

1) Advantages

(1) K-average algorithm is a kind of classical algorithm to solve the problem of clustering, and the algorithm is simple and fast.

(2) for the processing of large data sets, the algorithm is relatively scalable and efficient, because its complexity is about O (NKT), where n is the number of all objects, K is the number of clusters, T is the number of iterations. Usually k<<n. This algorithm often ends with a local optimum.

(3) The algorithm tries to find the K partition which minimizes the value of the square error function. When the clusters are dense, spherical or clustered, and the difference between clusters and clusters is obvious, its clustering effect is very good.

2) Disadvantages

(1) The K-average method can only be used if the average of the cluster is defined, and it does not apply to some applications, such as data that involves a classification attribute.

(2) Require the user to give the number of clusters to be generated in advance K.

(3) Sensitivity to initial value, for different initial values, may result in different clustering results.

(4) Not suitable for the discovery of non convex-shaped clusters, or large differences in size clusters.

(5) for "noise" and isolated point data sensitivity, a small amount of such data can have a significant impact on the average.

To solve the problem of the algorithm, some improvements are proposed for the K-means algorithm: first, the data preprocessing, the second is the initial cluster center selection, and the third is the selection of the cluster seed in the iterative process.

The sample data is normalized, so that the distance between the sample and the data of some large value attribute is prevented. Given a set of data sets containing n data, each data contains m attributes, each of which computes the average value of each attribute, and the standard deviation standardizes each piece of data.

Secondly, the selection of the initial cluster center has a great effect on the final clustering effect, the original K-means algorithm chooses K as the cluster center randomly, and the result of clustering is as similar as possible, so the selection of the initial cluster center should be done as much as possible. Based on the definition of distance and outliers, the isolation points are screened and the initial clustering centers are found in the remaining data sets using the maximum distance between 22 data. But for the actual data, the number of isolated points is often unpredictable. When selecting an initial cluster center, the isolated point is included in the statistical range, in the sample, the distance between the object 22 is computed, the two points with the largest distance are chosen as the cluster centers of two different classes, and then the distance and the largest point of all the selected cluster centers are identified from the rest of the sample objects as another cluster center. Until a K cluster center is selected. This reduces the effect of the sample input order on the initial cluster center selection.

When the cluster Center is selected, the iterative calculation will be carried out. In the K-means algorithm, the clustering mean point (the geometric center of all the data in the class) is used as new cluster seed to compute the cluster, in which case the new cluster seed may deviate from the real data dense region, which leads to the deviation, Especially in the case of the existence of isolated point, there are great limitations. When the initial center point is selected, the isolation point is avoided in the iterative process because the outliers are counted. According to the calculation of the cluster seed, we use the data with the similarity degree of the cluster seeds in the clusters with the k-1 wheel. The calculation of their mean point as the seed of K-round clustering is equivalent to excluding outliers, and the isolated point does not participate in the calculation of cluster center, so the cluster center will not be explained by the isolated point A place that deviates from the data set. When calculating the cluster center, we should use some algorithms to exclude outliers from the data of the mean point. In this paper, a subset of each class is composed of the data with the similarity degree of the cluster seed, and the mean point in the cluster is used as the cluster seed of the next round clustering. To allow more data to be involved in the cluster Center's computational species, the threshold range contains most of the data. Classes obtained in the K-1 round cluster, the average distance s of all the data in the class and the center of the classified collection class is calculated, and the data of cluster seed similarity is more than 2S to form a subset of each class, and the mean point of the subset is used as the cluster seed of the K-round cluster. In a dataset, the average distance can contain most of the data, regardless of whether there is a distinct isolated point.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.