When we were talking about Kmeans (5)

Last Update:2017-01-16 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This series is intended for long-term sharing, content may also be subject to deletion;
So if reproduced, please be sure to keep the source address, thank you very much!
Blog Park: http://www.cnblogs.com/data-miner/(temporary formula shows a problem)
Others: in construction ...

When we're talking about Kmeans: a summary overview

Through the previous reading K-means related papers, roughly can comb out the development process of K-means algorithm some traces. Since I have read only a part, so there will be more aspects, welcome to add (please give specific examples).

The proposed K-means algorithm
The article that analyzes the nature of the K-means algorithm is issued successively.
Extend the idea of the K-means algorithm:
- The author puts forward the "Maximum Entropy" algorithm and indicates that K-means is a special form
- Later, the author proposed "Mean Shift" algorithm, and said "Maximum Entropy" is also its special form
For K-means defects, modify the K-means algorithm (typically only for a scenario):
- Put forward the online K-means
- A K-means for non-convex data sets is proposed
- Put forward the application of K-means in FPGA
- The K-means of automatic weighting of features is proposed.
- Intelligent K-means algorithm using the idea of anomaly detection clustering
Optimization of the K-means algorithm:
- K-means of KD tree acceleration
- Accelerating K-means with SVD decomposition
- An initial clustering center algorithm for k-means++
Merge the K-means with the new ideas:
- Combining Ensembling and K-means

K-means Existing problems

K-means is widely used in data preprocessing, data analysis and so on because it is simple and effective. In the process of practical application of K-means, we also gradually found that there are many problems in itself. Such as:

Large computational capacity
The number of clusters K needs to be set in advance and affect the clustering effect
Clustering centers require human initialization and affect clustering effects
The presence of outliers can affect the clustering effect.
Can only converge to local optimal

Each of these issues is analyzed by the author and attempts to propose a solution:

Large computational capacity
- KD Tree accelerating K-means
The number of clusters K needs to be set in advance and affect the clustering effect
- A variety of methods for estimating K
Clustering centers require human initialization and affect clustering effects
- k-means++ method
- Other methods of initializing clustering centers
The presence of outliers can affect the clustering effect.
- Data preprocessing
Can only converge to local optimal
- Unknown

Below we have more information about two of them ("Category quantity estimate", "Initialize Cluster Center")

Category quantity Estimate

Estimating the number of categories is not yet a very common approach. Here are some ways to describe the number of common estimated categories

Prior knowledge of the data, or simple analysis of the data can be
Change-based algorithm: Defines a function that creates an extremum at the correct k.
Structure-based algorithm: the comparison of intra-class distances and inter-class distances to determine K.
Algorithm based on consistency matrix: that is to say that when the correct k, the results of different clusters will be more similar, so as to determine K.
Based on hierarchical clustering: that is, based on the idea of merging or splitting, to stop getting k under certain circumstances.
Sampling-based algorithm: that is, sample sampling, do clustering, according to the similarity of these results to determine K.

Initializing the cluster center

The next step is to introduce a few of the methods that you see to initialize the cluster center. It should be emphasized that the right approach in any scenario does not exist. Ideally, you should choose or design a suitable method for the characteristics of your data.

K-means++ has proven to be a simple, easy-to-use approach
First, the whole sample center is calculated, then the distance from the sample point to the center, and the near-far uniform sampling as the initial cluster center
Initially dividing the data into K-regions, each regional center as the initial cluster center
The "density" of each point is calculated, and the "density" is considered to be the cluster center. First, the "density" is the largest pick out as the first cluster center, from the remaining points to find the largest density, and all the existing cluster center is greater than a certain distance point as the next cluster center, until the K
Calculates the overall mean as the first cluster center. Search from the remaining points, when encountering points that are greater than a certain distance from all existing clusters, as the next cluster center, until K is selected

Summary of other clustering algorithms

Ing...

When we were talking about Kmeans (5)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

When we were talking about Kmeans (5)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

When we were talking about Kmeans (5)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support