Original: http://www.cnblogs.com/luxiaoxun/archive/2013/05/09/3069594.html
Clustering Chinese translation as "clustering", simply said to be similar to a group of things, with the classification (classification), for a classifier, usually need you to tell it "this thing is divided into XXX class" such as some examples, ideally, a Classifier will focus on "learning" from the training it receives, thus having the ability to classify unknown data, a process that provides training data, often called supervised learning (supervised learning). And when it comes to clustering, we don't care what a class is, the goal we need to achieve is to get things together, so a clustering algorithm usually needs to know how to calculate the similarity to get started, so clustering usually doesn't need to use training data to learn, which Machine learning is called unsupervised learning (unsupervised learning).
In data mining, K-means clustering algorithm is a kind of cluster analysis (clustering) algorithm, is a very simple distance-based clustering algorithm, that each cluster (class) is composed of similar points and this similarity is measured by distance, The points between the different cluster should be as dissimilar as possible, each cluster will have a "center of gravity"; it is also an exclusive algorithm, that is, any point must belong to a cluster and belong to that cluster.
The implementation of this algorithm is simple, as shown in the following:
Medium, A, B, C, D, E are five points at the midpoint of the graph. The gray point is the seed point, which is the "center of gravity" used to find cluster. There are two seed points, so k=2.
K-means algorithm steps:
The typical algorithm is as follows: It is an iterative algorithm.
(1) According to the pre-given K value to establish the initial division, get K cluster, for example, can randomly choose K points as the center of gravity of K cluster;
(2) Calculate the distance from each point to each cluster center of gravity and add it to the nearest cluster;
(3) Recalculate the center of gravity of each cluster;
(4) Repeat the process of one to several, until each cluster center of gravity in a certain range of accuracy does not change or reach the maximum number of iterations.
Although the algorithm is simple, the actual effect of many complex algorithms may be inferior to it, and its locality is better, easy to parallelize, very meaningful to large-scale data sets; The algorithm time complexity is: O (NKT), Where: N is the number of clusters, K is the number of cluster, T is the number of iterations.
The K-means algorithm mainly has two most significant defects, all related to the initial value:
- K is given beforehand, the selection of this k value is very difficult to estimate. Many times, there is no prior knowledge of how many categories a given dataset should fit into. (The ISODATA algorithm obtains the more reasonable type number K) through the automatic merging and splitting of the classes.
- The K-means algorithm needs to be made with an initial random seed point, which is too important for the random seed point to have a completely different result. (The k-means++ algorithm can be used to solve this problem, it can effectively select the initial point)
K-means algorithm C + + implementation: K-means.rar
GitHub Code: Https://github.com/luxiaoxun/k-means
The code comes from the network, modifies it slightly, and does a simple test.
K-means Clustering Algorithm C + + implementation