K-MEANS-algorithm-Overview

Source: Internet
Author: User

1. algorithm flow

Input: the number of clusters is k, and the database that contains n data objects. Output: k clusters that meet the minimum variance standard.
(1) Select k objects from n data objects as the initial cluster center.
(2) calculate the distance between each object and the cluster center, and re-divide the corresponding objects according to the minimum distance.
(3) recalculate the mean value of each cluster as the new cluster center.
(4) cycle (2) to (3) until each clustering does not change

2. Algorithm Analysis

The K-Means optimization goal can be expressed:

X_n indicates the data object, μ _ k indicates the center point, r_nk is 1 when data points n are allocated to Category k, and 0 when data points n are not allocated to Category k.

The entire algorithm uses Iterative Computing to find the appropriate r_nk and μ _ k to minimize J.
Step 2 of the algorithm flow, fix μ _ k, update r_nk, and place each data object in the category of its nearest cluster center, naturally, this step can minimize the value of J when the μ _ k is fixed.
Step 3 of the algorithm flow: Fix r_nk and update μ _ k. In this case, J pairs of μ _ k (actually μ _ 0, μ _ 1 ,... evaluate) evaluate and make the result equal to zero:

That is, when the new center point is the center value of each category, the standard distance within each category decreases most. J is the sum of the distances between all classes and the interior. Therefore, when r_nk is fixed, the value of J is minimized.
In the two steps, the J value is decreasing, and the J value decreases to a minimum value as the number of iterations increases.

3. End Condition

The K-means iteration conditions can be as follows:
· The internal elements of each cluster do not change, which is the ideal situation.
· For the first and second iterations, the Value Difference of J is smaller than a threshold value.
· Iterations exceed a certain number of times.

4. Disadvantages

· It is difficult to estimate the K value setting. If the data is actually 10 categories and K is set to 20, the result may be poor. If K is set to 10, the result is likely to be good.
· After K is determined, the initial center is also a problem. Once K centers are selected, the clustering results are determined. The selected results are good and the clustering results are good.
I personally think the main disadvantage is that there are also some improvement methods, which are not involved here. For details, refer to Baidu encyclopedia _ k-means.

5. Key Points

There are two main points in this article:
The three ending conditions of K-means (not changed, J value slightly changed, iterations) and two disadvantages (K value, K centers ).

6. Reference

K-MEANS Co., http://baike.baidu.com/view/31854.htm.
Baidu baibaibai_k-Means http://baike.baidu.com/view/3066906.htm
Talking about Clustering (1): k-means http://blog.pluskid.org /? P = 17 # comments

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.