Summary of "reprint" Clustering algorithm

Source: Internet
Author: User


Clustering Algorithm Summary:
---------------------------------------------------------
Categories of clustering algorithms:


Based on partition clustering algorithm (partition Clustering)
K-means: is a typical partition clustering algorithm, which uses a clustering center to represent a cluster, that is, the selected points in the iterative process is not necessarily a point in the cluster, the algorithm can only process numerical data
K-modes: The extension of K-means algorithm, using simple matching method to measure the similarity of classification data
K-prototypes: Combined with K-means and k-modes two algorithms to handle mixed-type data
K-medoids: To select a point in the cluster as the gathering point during the iterative process, Pam is a typical k-medoids algorithm.
CLARA: Based on Pam, the Clara algorithm uses sampling techniques to handle large-scale data
Clarans: The Clarans algorithm combines the advantages of Pam and Clara, and is the first clustering algorithm for spatial databases.
Focused Claran: Using spatial index technology to improve the efficiency of Clarans algorithm
Pcm: Fuzzy set theory is introduced into cluster analysis and a PCM fuzzy clustering algorithm is proposed.


Based on hierarchical clustering algorithm:
CURE: Sampling technique was used to randomly sample the data set D, then partitioned the samples by partitioning technology, then clustered locally on each partition, and finally the local cluster was globally clustered.
ROCK: The random sampling technique is used to calculate the similarity of two objects and to consider the influence of the surrounding objects.
Chemaloen (Chameleon Algorithm): Firstly, the data set is constructed into a K-nearest neighbor graph GK, then the graph GK is divided into a large number of sub-graphs by a graph partition algorithm, each sub-graph represents an initial sub-cluster, and finally, a cohesive hierarchical clustering algorithm is used to merge the subgroups repeatedly to find the real result cluster.
SBAC: The SBAC algorithm, when calculating the similarity between objects, takes into account the importance of attribute characteristics to the essence of the object, and assigns higher weights to the attributes that embody the essence of the object.
BIRCH: The birch algorithm uses the tree structure to process the data set, the leaf node stores a cluster, uses the center and the radius representation, processes each object sequentially, divides it into the nearest node, and the algorithm can be used as the preprocessing process of other clustering algorithms.
BUBBLE: The bubble algorithm promotes the concept of the center and radius of the birch algorithm to the normal distance space.
BUBBLE-FM: The BUBBLE-FM algorithm improves the efficiency of the BUBBLE algorithm by reducing the number of distance calculations


Based on the density Clustering algorithm:
DBSCAN: Dbscan algorithm is a typical density-based clustering algorithm, which uses spatial indexing technology to search the neighborhood of objects, introduces the concepts of "core object" and "density can reach", and from the core object, makes a cluster of all objects with density.
Gdbscan: The algorithm uses the concept of neighborhood in generalization Dbscan algorithm to adapt to the characteristics of space object.
DBLASD:
OPTICS: Optics algorithm combines the automatic and interactive clustering, the order of PLA class, can set different parameters for different clustering, come to the result of user satisfaction
FDC: The FDC algorithm divides the entire data space into several rectangular spaces by constructing k-d tree, which can greatly improve the efficiency of dbscan when the dimension of space is small.


Grid-based Clustering algorithm:
STING: Using grid cells to save data statistics for multi-resolution clustering
Wavecluster: The principle of wavelet transform is introduced in cluster analysis, which is mainly used in the field of signal processing. (Note: Wavelet algorithm in the field of signal processing, graphic image, encryption and decryption has important applications, is a more advanced and good things)
Clique: is a clustering algorithm that combines grid and density
Optigrid:


A clustering algorithm based on neural networks:
Self-organizing neural network som: The basic idea of this method is to input different samples from the outside to the artificial self-organizing map network, at the beginning, the input sample causes the output excited cells to have different positions, but the self-organization will form some cell groups, which represent the input samples, which reflect the characteristics of the input samples.





Clustering algorithm based on statistics:
Cobweb: Cobweb is a general concept clustering method, which shows hierarchical clustering in the form of classification tree.
Classit:
Autoclass: Based on the probabilistic mixed model, the probability distribution of the attribute is used to describe the clustering, which can deal with the mixed data, but requires each property to be independent



---------------------------------------------------------
Several commonly used clustering algorithms are evaluated comprehensively from scalability, suitable data type, high dimension (ability to handle high dimensional data), anti-interference degree of anomaly data, clustering shape and algorithm efficiency, and the evaluation results are as follows in Table 1:6

Algorithm name Scalability The appropriate data type High dimensional nature Anti-jamming of abnormal data Cluster shape Algorithmic efficiency
Wavecluster Very high Numeric type Very high Higher Any shape Very high
ROCK Very high Mixed type Very high Very high Any shape So so
BIRCH Higher Numeric type Lower Lower Spherical Very high
CURE Higher Numeric type So so Very high Any shape Higher
K-prototypes So so Mixed type Lower Lower Any shape So so
Denclue Lower Numeric type Higher So so Any shape Higher
Optigrid So so Numeric type Higher So so Any shape So so
Clique Higher Numeric type Higher Higher Any shape Lower
DBSCAN So so Numeric type Lower Higher Any shape So so
Clarans Lower Numeric type Lower Higher Spherical Lower


---------------------------------------------------------
The main content of the current cluster analysis research:


The research on clustering is a hot direction in data mining, because there are some shortcomings in the clustering method described above, so many researches on clustering analysis in recent years focus on improving existing clustering methods or proposing a new clustering method. Here is a simple summary of the problems that exist in traditional clustering methods and the efforts that people make on these issues:


1 from the above analysis of the traditional clustering method, whether it is the K-means method, or the Cure method, before the clustering requires the user to determine the number of clusters to be obtained beforehand. However, in the real data, the number of clusters is unknown, usually through continuous experiments to obtain the appropriate number of clusters, to obtain better clustering results.

2 Traditional clustering methods are generally suitable for a certain situation of clustering, there is no way to meet the clustering of a variety of situations, such as the Birch method for globular clusters have a good clustering performance, but for irregular clustering, it is not very good work; the K-medoids method is less affected by outliers, But its calculation cost is very big. Therefore, how to solve this problem has become a research hotspot, and some scholars have proposed to combine different clustering ideas to form a new clustering algorithm, so as to utilize the advantages of different clustering algorithms, it can effectively alleviate this problem by using multiple clustering methods in a single clustering process.

3 with the advent of the information age, the analysis and processing of large amounts of data is a very large work, which relates to a computational efficiency problem. A clustering algorithm based on the minimum spanning tree is proposed, which realizes the clustering result by discarding the longest edge gradually, and when the length of an edge exceeds a certain threshold, the longer edge is not required to be calculated and discarded directly, which greatly improves the computational efficiency and reduces the computational cost.

4 The ability to handle large-scale data and high-dimensional data needs to be improved. At present, many clustering methods have better performance when dealing with small-scale data and low-dimensional data, but when the data scale increases, the performance will drop sharply when the dimension is increased, for example, the K-medoids method can handle small-scale data very well, but with the increase of data volume, the efficiency will decrease gradually. In fact, most of the data in real life belongs to the data set with larger size and higher dimension. In this paper, a method Pcka (projected clustering based on the K-means algorithm) is proposed for mining map clustering in high dimensional space, which selects attribute-related dimensions from multiple dimensions, removes irrelevant dimensions, and clusters along related dimensions To cluster high-dimensional data.

5 at present, many algorithms are only theoretical, often under some assumptions, such as clustering can be very good separation, no prominent isolated points, but the actual data is usually very complex, noise is very large, so how to effectively eliminate the impact of noise, improve the ability to deal with real data needs to be further improved.

Summary of "reprint" Clustering algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.