Cluster analysis groups data Objects (clusters) based only on the information found in the data describing the objects and their relationships . The goal is that objects within a group are similar to each other, and objects in different groups are different. The greater the similarity within the group, the greater the difference between groups, the better the clustering.
The different types of clustering are introduced first, usually in the following categories:
(1) Hierarchical and divided: If a cluster is allowed to have sub-clusters, then we get a hierarchical cluster. Hierarchical clustering is the set family of nested clusters, organized into a tree. partitioning clustering simply divides data objects into non-overlapping subsets ( clusters ), so that each data object is just one sub-set.
(2) mutually exclusive, overlapping, and ambiguous: mutually exclusive means each object is assigned to a single cluster. Overlapping or fuzzy clustering is used to reflect the fact that an object belongs to more than one group at a time. in fuzzy clustering, each data object is 0 and 1 the subordinate weights between the values belong to each cluster. the sum of the subordinate weights of each object to each cluster is often 1.
(3) Complete and partial: complete Clustering assigns each object to a cluster . In some clusters, some objects may not belong to any group, such as some noisy objects.
Clusters found after clustering often also have different types:
(1) Obvious separation: Cluster is the collection of objects, the distance between any two points in different groups is greater than the distance between any two points in the group. (1)
(2) prototype-based: A cluster is a collection of objects in which each object is closer (or more similar) to a prototype that defines the cluster than to a prototype of another cluster. For data with continuous attributes, the prototype of a cluster is usually the centroid, which is a bit of the average in the cluster. This cluster tends to be spherical.
(3) Graph-based: If the data is represented by a graph where the nodes are objects and the edges represent the relationships between the objects, the clusters can be defined as connected branches, which are groups of objects that are interconnected but not connected to the outside of the group. An important example of a graph-based cluster is the proximity of clusters, where two of the objects are connected only if their distances are within the specified range. That is, each object's distance from one object to that cluster is closer than any point in a different cluster .
(4) Density-based: clusters are dense areas of the object, surrounded by low-density areas. Density-based cluster definitions are often used when clusters are irregular or coiled to each other and have noise and outlier points.
Here are three common clustering algorithms:
(1) Basic K mean: A prototype-based, partitioned clustering technique that attempts to discover a user-specified number of clusters from all data objects.
(2) Aggregation hierarchy: Start each point into a cluster, and then repeat the merging of two nearest clusters until the specified number of clusters.
(3) DBSCAN: A classified, density-based clustering algorithm.
In this paper, we take the cluster of data point objects in two-dimensional space as an example, and then introduce three clustering algorithms in sequence. We use the source file that represents the data points of the two-dimensional space, one data point for each behavior, and the format is the x-coordinate value # y-coordinate value.
Basic K mean : Select the k initial centroid, where K is the user-specified parameter, which is the number of clusters expected. Each point is assigned to the nearest centroid in each loop, and a set of points assigned to the same centroid forms a cluster. Then, the centroid of each cluster is updated according to the points assigned to the cluster. Repeat the assignment and update operations until the centroid does not change significantly.
To define the "recent" concept between data points in a two-dimensional space, we use the square of Euclidean distance, that is, point A (x1,y1) and point B (x2,y3) are dist (A, b) = (X1-X2) (Y1-y2) 2. In addition, we use squared and SSE as the global objective function, that is, to minimize the square sum of Euclidean distances from each point to the nearest centroid. In the case of setting this SSE, it can be mathematically proved that the centroid of the cluster is the average of all the data points within the cluster.
According to the algorithm, the following code is implemented:
https://github.com/intergret/snippet/blob/master/Kmeans.py
or http://www.oschina.net/code/snippet_176897_14731.
The result of clustering is that the polyline in the graph is the updated trajectory of the centroid of the 3 clusters in previous cycles, and the Black Point is the initial centroid:
We look at the basic K-mean algorithm implementation steps and the above clustering effect can be found that the clustering algorithm will all the data points are assigned, do not recognize the noise point. In addition, selecting the appropriate initial centroid is the key to the basic K-means process. In fact, as long as two of the initial initial centroid falls in a cluster pair of any position, you can get the best clustering, because the centroid will itself redistribution, each cluster one, is the smallest SSE. If a cluster has only one centroid at the initial interview, then the basic K-mean algorithm cannot redistribute the centroid in the cluster pair, only the local optimal solution. In addition, it can not deal with non-spherical clusters, different sizes and different densities of clusters.
More about The basic K-mean algorithm You can also check out the Chenhao blog:http://coolshell.cn/articles/7779.html
Condensed Hierarchical Clustering : the so-called condensed, refers to the algorithm initially, each point as a cluster, each step to merge the two closest cluster. In addition, even in the end, the noise point or outliers are often a cluster, unless excessive merger. For the "closest" here, there are three kinds of definitions. I am using MIN in the implementation, the method when merging, as long as the current nearest point pair, if the point pair is not currently in a cluster, will be in the same two clusters of the row:
(1) Single strand (MIN): Defines the distance between clusters of two closest points of different two clusters.
(2) Full chain (MAX): Defines the distance between clusters of two farthest points of different two clusters.
(3) Group averaging: the definition of the cluster's proximity is the average of all point-to-adjacency degrees taken from two different clusters.
According to the algorithm, the following code is implemented. Calculates the distance from each point pair at the beginning and merges in descending order by distance. In addition, in order to prevent excessive merging, the defined exit condition is that the cluster of 90% is merged, that is, the current cluster number is the 10% of the initial cluster :
https://github.com/intergret/snippet/blob/master/HAC.py
or http://www.oschina.net/code/snippet_176897_14732.
The effect of clustering, such as, Black is the noise point:
In addition, we can see that the hierarchical clustering of condensation does not have a global objective function similar to the basic K - mean value , there is no local minimum problem or it is difficult to select the initial point problem. The merge operation is often final and will not be revoked once the two clusters have been merged. Of course, the cost of calculating storage is expensive.
DBSCAN : is a simple, density-based clustering algorithm. In this implementation,Dbscan uses a center-based approach. In a center-based approach, the density of each data point is measured by the number of other points in the grid (neighborhood) that is centered on that point in the side length 2*eps . There are three types of points based on the density of data points:
(1) Core point: the density of the point within the neighborhood exceeds the given threshold MinPs.
(2) Boundary point: The point is not a core point, but its neighborhood contains at least one core point.
(3) Noise point: Not the core point, nor the boundary point.
With the partitioning of the above logarithmic bases, aggregations can be carried out: each core point is placed in the same cluster as all the core points in its neighborhood, and the boundary points are placed in the same cluster as a core point within its neighborhood.
According to the algorithm, the following code is implemented:
https://github.com/intergret/snippet/blob/master/Dbscan.py
or http://www.oschina.net/code/snippet_176897_14734.
The effect of clustering, such as, Black is the noise point:
Because DBSCAN uses a cluster's density-based definition, it is relatively noise-resistant and can handle clusters of arbitrary shapes and sizes. However, if the density of the cluster varies greatly, such as the ABCD four clusters,the ab density is much larger than the CD, and the density of the noise near AB is comparable to the density of the cluster CD, which is when the Minps is large, the cluster CD is not recognized , the cluster Noise near the CD and AB is considered noise, and when the Minps is small, the cluster CD is recognized , but the noise around AB is recognized as a cluster. This problem can be based on the clustering outcome of the shared nearest neighbor (SNN).
Clustering algorithm: K-mean, condensed hierarchical clustering and Dbscan