Summary of "reprint" Clustering algorithm

Last Update:2016-03-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Clustering Algorithm Summary:
---------------------------------------------------------
Categories of clustering algorithms:

Based on partition clustering algorithm (partition Clustering)

K-means:	is a typical partition clustering algorithm, which uses a clustering center to represent a cluster, that is, the selected points in the iterative process is not necessarily a point in the cluster, the algorithm can only process numerical data
K-modes:	The extension of K-means algorithm, using simple matching method to measure the similarity of classification data
K-prototypes:	Combined with K-means and k-modes two algorithms to handle mixed-type data
K-medoids:	To select a point in the cluster as the gathering point during the iterative process, Pam is a typical k-medoids algorithm.
CLARA:	Based on Pam, the Clara algorithm uses sampling techniques to handle large-scale data
Clarans:	The Clarans algorithm combines the advantages of Pam and Clara, and is the first clustering algorithm for spatial databases.
Focused Claran:	Using spatial index technology to improve the efficiency of Clarans algorithm
Pcm:	Fuzzy set theory is introduced into cluster analysis and a PCM fuzzy clustering algorithm is proposed.

Based on hierarchical clustering algorithm:

CURE:	Sampling technique was used to randomly sample the data set D, then partitioned the samples by partitioning technology, then clustered locally on each partition, and finally the local cluster was globally clustered.
ROCK:	The random sampling technique is used to calculate the similarity of two objects and to consider the influence of the surrounding objects.
Chemaloen (Chameleon Algorithm):	Firstly, the data set is constructed into a K-nearest neighbor graph GK, then the graph GK is divided into a large number of sub-graphs by a graph partition algorithm, each sub-graph represents an initial sub-cluster, and finally, a cohesive hierarchical clustering algorithm is used to merge the subgroups repeatedly to find the real result cluster.
SBAC:	The SBAC algorithm, when calculating the similarity between objects, takes into account the importance of attribute characteristics to the essence of the object, and assigns higher weights to the attributes that embody the essence of the object.
BIRCH:	The birch algorithm uses the tree structure to process the data set, the leaf node stores a cluster, uses the center and the radius representation, processes each object sequentially, divides it into the nearest node, and the algorithm can be used as the preprocessing process of other clustering algorithms.
BUBBLE:	The bubble algorithm promotes the concept of the center and radius of the birch algorithm to the normal distance space.
BUBBLE-FM:	The BUBBLE-FM algorithm improves the efficiency of the BUBBLE algorithm by reducing the number of distance calculations

Based on the density Clustering algorithm:

DBSCAN:	Dbscan algorithm is a typical density-based clustering algorithm, which uses spatial indexing technology to search the neighborhood of objects, introduces the concepts of "core object" and "density can reach", and from the core object, makes a cluster of all objects with density.
Gdbscan:	The algorithm uses the concept of neighborhood in generalization Dbscan algorithm to adapt to the characteristics of space object.
DBLASD:
OPTICS:	Optics algorithm combines the automatic and interactive clustering, the order of PLA class, can set different parameters for different clustering, come to the result of user satisfaction
FDC:	The FDC algorithm divides the entire data space into several rectangular spaces by constructing k-d tree, which can greatly improve the efficiency of dbscan when the dimension of space is small.

Grid-based Clustering algorithm:

STING:	Using grid cells to save data statistics for multi-resolution clustering
Wavecluster:	The principle of wavelet transform is introduced in cluster analysis, which is mainly used in the field of signal processing. (Note: Wavelet algorithm in the field of signal processing, graphic image, encryption and decryption has important applications, is a more advanced and good things)
Clique:	is a clustering algorithm that combines grid and density
Optigrid:

A clustering algorithm based on neural networks:

Self-organizing neural network som:

The basic idea of this method is to input different samples from the outside to the artificial self-organizing map network, at the beginning, the input sample causes the output excited cells to have different positions, but the self-organization will form some cell groups, which represent the input samples, which reflect the characteristics of the input samples.

Clustering algorithm based on statistics:

Cobweb:	Cobweb is a general concept clustering method, which shows hierarchical clustering in the form of classification tree.
Classit:
Autoclass:	Based on the probabilistic mixed model, the probability distribution of the attribute is used to describe the clustering, which can deal with the mixed data, but requires each property to be independent

---------------------------------------------------------
Several commonly used clustering algorithms are evaluated comprehensively from scalability, suitable data type, high dimension (ability to handle high dimensional data), anti-interference degree of anomaly data, clustering shape and algorithm efficiency, and the evaluation results are as follows in Table 1:6

Algorithm name	Scalability	The appropriate data type	High dimensional nature	Anti-jamming of abnormal data	Cluster shape	Algorithmic efficiency
Wavecluster	Very high	Numeric type	Very high	Higher	Any shape	Very high
ROCK	Very high	Mixed type	Very high	Very high	Any shape	So so
BIRCH	Higher	Numeric type	Lower	Lower	Spherical	Very high
CURE	Higher	Numeric type	So so	Very high	Any shape	Higher
K-prototypes	So so	Mixed type	Lower	Lower	Any shape	So so
Denclue	Lower	Numeric type	Higher	So so	Any shape	Higher
Optigrid	So so	Numeric type	Higher	So so	Any shape	So so
Clique	Higher	Numeric type	Higher	Higher	Any shape	Lower
DBSCAN	So so	Numeric type	Lower	Higher	Any shape	So so
Clarans	Lower	Numeric type	Lower	Higher	Spherical	Lower

---------------------------------------------------------
The main content of the current cluster analysis research:

The research on clustering is a hot direction in data mining, because there are some shortcomings in the clustering method described above, so many researches on clustering analysis in recent years focus on improving existing clustering methods or proposing a new clustering method. Here is a simple summary of the problems that exist in traditional clustering methods and the efforts that people make on these issues:

1 from the above analysis of the traditional clustering method, whether it is the K-means method, or the Cure method, before the clustering requires the user to determine the number of clusters to be obtained beforehand. However, in the real data, the number of clusters is unknown, usually through continuous experiments to obtain the appropriate number of clusters, to obtain better clustering results.

2 Traditional clustering methods are generally suitable for a certain situation of clustering, there is no way to meet the clustering of a variety of situations, such as the Birch method for globular clusters have a good clustering performance, but for irregular clustering, it is not very good work; the K-medoids method is less affected by outliers, But its calculation cost is very big. Therefore, how to solve this problem has become a research hotspot, and some scholars have proposed to combine different clustering ideas to form a new clustering algorithm, so as to utilize the advantages of different clustering algorithms, it can effectively alleviate this problem by using multiple clustering methods in a single clustering process.

3 with the advent of the information age, the analysis and processing of large amounts of data is a very large work, which relates to a computational efficiency problem. A clustering algorithm based on the minimum spanning tree is proposed, which realizes the clustering result by discarding the longest edge gradually, and when the length of an edge exceeds a certain threshold, the longer edge is not required to be calculated and discarded directly, which greatly improves the computational efficiency and reduces the computational cost.

4 The ability to handle large-scale data and high-dimensional data needs to be improved. At present, many clustering methods have better performance when dealing with small-scale data and low-dimensional data, but when the data scale increases, the performance will drop sharply when the dimension is increased, for example, the K-medoids method can handle small-scale data very well, but with the increase of data volume, the efficiency will decrease gradually. In fact, most of the data in real life belongs to the data set with larger size and higher dimension. In this paper, a method Pcka (projected clustering based on the K-means algorithm) is proposed for mining map clustering in high dimensional space, which selects attribute-related dimensions from multiple dimensions, removes irrelevant dimensions, and clusters along related dimensions To cluster high-dimensional data.

5 at present, many algorithms are only theoretical, often under some assumptions, such as clustering can be very good separation, no prominent isolated points, but the actual data is usually very complex, noise is very large, so how to effectively eliminate the impact of noise, improve the ability to deal with real data needs to be further improved.

Summary of "reprint" Clustering algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Summary of "reprint" Clustering algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Summary of "reprint" Clustering algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support