K-means Clustering Algorithm C + + implementation

Source: Internet
Author: User
Tags random seed

Original: http://www.cnblogs.com/luxiaoxun/archive/2013/05/09/3069594.html

Clustering Chinese translation as "clustering", simply said to be similar to a group of things, with the classification (classification), for a classifier, usually need you to tell it "this thing is divided into XXX class" such as some examples, ideally, a Classifier will focus on "learning" from the training it receives, thus having the ability to classify unknown data, a process that provides training data, often called supervised learning (supervised learning). And when it comes to clustering, we don't care what a class is, the goal we need to achieve is to get things together, so a clustering algorithm usually needs to know how to calculate the similarity to get started, so clustering usually doesn't need to use training data to learn, which Machine learning is called unsupervised learning (unsupervised learning).

In data mining, K-means clustering algorithm is a kind of cluster analysis (clustering) algorithm, is a very simple distance-based clustering algorithm, that each cluster (class) is composed of similar points and this similarity is measured by distance, The points between the different cluster should be as dissimilar as possible, each cluster will have a "center of gravity"; it is also an exclusive algorithm, that is, any point must belong to a cluster and belong to that cluster.

The implementation of this algorithm is simple, as shown in the following:

Medium, A, B, C, D, E are five points at the midpoint of the graph. The gray point is the seed point, which is the "center of gravity" used to find cluster. There are two seed points, so k=2.

K-means algorithm steps:

The typical algorithm is as follows: It is an iterative algorithm.

(1) According to the pre-given K value to establish the initial division, get K cluster, for example, can randomly choose K points as the center of gravity of K cluster;

(2) Calculate the distance from each point to each cluster center of gravity and add it to the nearest cluster;

(3) Recalculate the center of gravity of each cluster;

(4) Repeat the process of one to several, until each cluster center of gravity in a certain range of accuracy does not change or reach the maximum number of iterations.

Although the algorithm is simple, the actual effect of many complex algorithms may be inferior to it, and its locality is better, easy to parallelize, very meaningful to large-scale data sets; The algorithm time complexity is: O (NKT), Where: N is the number of clusters, K is the number of cluster, T is the number of iterations.

The K-means algorithm mainly has two most significant defects, all related to the initial value:

    • K is given beforehand, the selection of this k value is very difficult to estimate. Many times, there is no prior knowledge of how many categories a given dataset should fit into. (The ISODATA algorithm obtains the more reasonable type number K) through the automatic merging and splitting of the classes.
    • The K-means algorithm needs to be made with an initial random seed point, which is too important for the random seed point to have a completely different result. (The k-means++ algorithm can be used to solve this problem, it can effectively select the initial point)

K-means algorithm C + + implementation: K-means.rar

GitHub Code: Https://github.com/luxiaoxun/k-means

The code comes from the network, modifies it slightly, and does a simple test.

K-means Clustering Algorithm C + + implementation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.