When we were talking about Kmeans (5)

Source: Internet
Author: User


This series is intended for long-term sharing, content may also be subject to deletion;
So if reproduced, please be sure to keep the source address, thank you very much!
Blog Park: http://www.cnblogs.com/data-miner/(temporary formula shows a problem)
Others: in construction ...

When we're talking about Kmeans: a summary overview

Through the previous reading K-means related papers, roughly can comb out the development process of K-means algorithm some traces. Since I have read only a part, so there will be more aspects, welcome to add (please give specific examples).

    1. The proposed K-means algorithm
    2. The article that analyzes the nature of the K-means algorithm is issued successively.
    3. Extend the idea of the K-means algorithm:
      • The author puts forward the "Maximum Entropy" algorithm and indicates that K-means is a special form
      • Later, the author proposed "Mean Shift" algorithm, and said "Maximum Entropy" is also its special form
    4. For K-means defects, modify the K-means algorithm (typically only for a scenario):
      • Put forward the online K-means
      • A K-means for non-convex data sets is proposed
      • Put forward the application of K-means in FPGA
      • The K-means of automatic weighting of features is proposed.
      • Intelligent K-means algorithm using the idea of anomaly detection clustering
    5. Optimization of the K-means algorithm:
      • K-means of KD tree acceleration
      • Accelerating K-means with SVD decomposition
      • An initial clustering center algorithm for k-means++
    6. Merge the K-means with the new ideas:
      • Combining Ensembling and K-means
K-means Existing problems

K-means is widely used in data preprocessing, data analysis and so on because it is simple and effective. In the process of practical application of K-means, we also gradually found that there are many problems in itself. Such as:

    1. Large computational capacity
    2. The number of clusters K needs to be set in advance and affect the clustering effect
    3. Clustering centers require human initialization and affect clustering effects
    4. The presence of outliers can affect the clustering effect.
    5. Can only converge to local optimal

Each of these issues is analyzed by the author and attempts to propose a solution:

    1. Large computational capacity
      • KD Tree accelerating K-means
    2. The number of clusters K needs to be set in advance and affect the clustering effect
      • A variety of methods for estimating K
    3. Clustering centers require human initialization and affect clustering effects
      • k-means++ method
      • Other methods of initializing clustering centers
    4. The presence of outliers can affect the clustering effect.
      • Data preprocessing
    5. Can only converge to local optimal
      • Unknown

Below we have more information about two of them ("Category quantity estimate", "Initialize Cluster Center")

Category quantity Estimate

Estimating the number of categories is not yet a very common approach. Here are some ways to describe the number of common estimated categories

    1. Prior knowledge of the data, or simple analysis of the data can be

    2. Change-based algorithm: Defines a function that creates an extremum at the correct k.

    3. Structure-based algorithm: the comparison of intra-class distances and inter-class distances to determine K.

    4. Algorithm based on consistency matrix: that is to say that when the correct k, the results of different clusters will be more similar, so as to determine K.

    5. Based on hierarchical clustering: that is, based on the idea of merging or splitting, to stop getting k under certain circumstances.

    6. Sampling-based algorithm: that is, sample sampling, do clustering, according to the similarity of these results to determine K.

Initializing the cluster center

The next step is to introduce a few of the methods that you see to initialize the cluster center. It should be emphasized that the right approach in any scenario does not exist. Ideally, you should choose or design a suitable method for the characteristics of your data.

    1. K-means++ has proven to be a simple, easy-to-use approach
    2. First, the whole sample center is calculated, then the distance from the sample point to the center, and the near-far uniform sampling as the initial cluster center
    3. Initially dividing the data into K-regions, each regional center as the initial cluster center
    4. The "density" of each point is calculated, and the "density" is considered to be the cluster center. First, the "density" is the largest pick out as the first cluster center, from the remaining points to find the largest density, and all the existing cluster center is greater than a certain distance point as the next cluster center, until the K
    5. Calculates the overall mean as the first cluster center. Search from the remaining points, when encountering points that are greater than a certain distance from all existing clusters, as the next cluster center, until K is selected
Summary of other clustering algorithms

Ing...

When we were talking about Kmeans (5)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.