Comparative Research on Clustering Algorithms in Data Mining

Source: Internet
Author: User

Computer Applications and software 2003 Vol.20 No. 2: 5 ~ 6

Comparative Research on Clustering Algorithms in Data Mining
Zhang hongyun, Liu Xiangdong, Xiao Dong, Miao chuqian, Ma Yuan
(School of electronic and information engineering, Tongji University, Shanghai 200092, China)
(Department of Computer Science, Dalian University for Nationalities, Dalian 116600, China)
(School of Computer Science and Engineering, Anshan University of Science and Technology, Anshan 114002, China)

  Abstract: Clustering algorithms are the core technology of data mining. This paper proposes five criteria for evaluating clustering algorithms. Based on these five criteria, the common clustering algorithms used in data mining are compared and analyzed, so that people can easily and quickly find a clustering algorithm suitable for specific problems.
  Keywords: Data Mining, balanced iterative reduction of clustering algorithms, represented by node clustering algorithms, and density-Based Clustering Algorithms

The comparison of clustering methods in Data Mining
  Abstract: Clustering method is the core of data mining technology. in this paper, five standards were put forward which are used to evaluate these clustering methods. these clustering methods were compared and analyzed according to the standards so that people can easily and quickly find a clustering method that suit a special problem.
  Keywords: Data mining; birch; DBSCAN; cure

1 Introduction
Classifying objects in a database is the basic operation of data mining. The criterion is to keep the distance between individuals of the same class as small as possible, while the distance between individuals of different classes as large as possible, in order to find efficient and universal clustering methods, nearly clustering methods are proposed from different perspectives. Typical K-means, K-medoids, and CLARANS methods are used, birch methods. These algorithms apply to specific problems and users. This paper proposes five criteria for evaluating clustering algorithms. Based on these five criteria, we compare and analyze common clustering methods in data mining, this makes it easier and faster for people to find a clustering algorithm suitable for specific problems and users.

2. Research and comparison framework of Data Mining Clustering Algorithms
Clustering Algorithms are generally divided into two types: segmentation and layering. The segmentation clustering algorithm divides the dataset into k parts by using the optimization evaluation function, which requires K as the input parameter. Typical segmentation clustering algorithms include K-means, K-medoids, and CLARANS. Hierarchical Clustering is composed of different levels of segmentation clusters. The segmentation between layers has a nested relationship. It does not need to input parameters, which is an obvious advantage over the segmentation clustering algorithm. Its disadvantage is that the termination condition must be specified. Typical hierarchical clustering algorithms include Birch, DBSCAN, and cure.
The comparative study of each clustering algorithm is based on the following five standards:
① Whether it is suitable for large data volumes and whether the algorithm efficiency meets the requirements of high complexity of large data volumes;
② Whether it can cope with different data types and whether it can process symbolic attributes;
③ Whether different types of clusters can be found;
④ Whether it can cope with dirty data or abnormal data;
⑤ Whether the data input sequence is insensitive.
Next we will analyze and compare each clustering algorithm under this framework.

3 Comparison and Analysis of Common clustering algorithms for Data Mining
3.1 birch Algorithm
The birch algorithm is the balanced iterative reduction clustering method. Its core is to use a clustering feature 3 tuples to represent the relevant information of a cluster, so that a cluster point can be used to represent the corresponding clustering features, instead of using a specific set of vertices. It constructs a clustering feature tree that satisfies the branch factor and cluster diameter limitation to obtain clustering. Through clustering features, the birch algorithm can easily calculate the center, radius, diameter, and the distance between classes and classes. The algorithm's clustering feature tree is a high-Balance Tree with two parameters, branch factor B and class-diameter T. The branch factor specifies the maximum number of children of each node in the tree, and the class diameter reflects the limitation on the diameter of a class of points, that is, the scope of these points can be grouped into a class, the non-leaf node is the biggest keyword of its children. You can insert the index based on these keywords, which summarizes the information of its children.
The clustering feature tree can be dynamically constructed, so it does not require the memory of all data readers, but can be read one by one on the external. New data items are always inserted into the leaf closest to the data in the tree. If the inserted leaf has a diameter greater than the class diameter T, the leaf node is split. Other leaf nodes also need to check whether the split is exceeded by the branch factor until the data is inserted into the leaves and the diameter of the class is not exceeded, the number of children of each non-leaf node is not greater than that of the branch factor. The algorithm can also modify the feature tree size by changing the class diameter to control its memory size.
The birch algorithm can perform better clustering after one scan. It can be seen that the algorithm is suitable for large data volumes. For a given m MB of memory, the space complexity is O (M), and the time complexity is O (dnblnb (M/P )). where D is the dimension, n is the number of nodes, p is the size of the Memory Page, and B is the branch factor determined by P. The I/O cost is linearly related to the data volume. The birch algorithm is only applicable to convex and spherical distribution of classes. The birch algorithm must provide the correct number of clusters and the cluster diameter limit, which is not feasible for non-visual high-dimensional data.
3.2 CURE algorithm
The CURE algorithm uses clustering methods that represent points. This algorithm first treats each data point as a class, and then merges the nearest class until the number of classes is the required number. The CURE algorithm improves the traditional class representation method, avoiding the use of all vertices or the center and radius to represent a class, instead, a fixed number of vertices with better distribution are extracted from each class as the representative points of this class, and these points are multiplied by an appropriate contraction factor to bring them closer to the center of the class. A class is represented by a representative vertex, so that the extension of the class can be expanded to a non-spherical shape, so that the shape of the class can be adjusted to express those non-spherical classes. In addition, the use of contraction factor reduces the effect of voice on clustering. The CURE algorithm uses a combination of random sampling and segmentation to improve the algorithm's space and time efficiency, and uses heap and K-D tree structures to improve the algorithm efficiency.
3.3 DBSCAN algorithm
The DBSCAN algorithm is a density-based clustering algorithm. This algorithm uses the class density connectivity to quickly discover classes of any shape. The basic idea is: for each object in a class, the objects contained in the field with a given radius cannot be less than the minimum number given by a certain class. In the DBSCAN algorithm, the process of discovering a class is based on the fact that a class can be determined by any of its core objects. To find a class, DBSCAN first finds any object P from object set D, and finds all objects in D with regard to the off-path EPS and Min object number minpts which can reach the P density. If P is the core object, that is, the object in the neighborhood of P whose radius is EPS is not less than minpts, a class about the parameter EPS and minpts can be found based on the algorithm. If P is a boundary point, the p neighbor with a radius of EPS contains fewer objects than minpts, and P is temporarily marked as a noise point. DBSCAN then processes the next object in D.
The acquisition of density reachable objects is achieved through continuous execution of regional queries. A Region Query returns all objects in the specified region. To effectively perform regional queries, the DBSCAN algorithm uses the R-tree structure of spatial queries. Before clustering, you must create an R *-tree for all data. In addition, DBSCAN requires you to specify a global parameter EPS (to reduce the calculation workload, the minpts parameter is pre-determined ). To determine the value, DBSCAN calculates the distance between any object and its K nearest object. Then, sort the obtained distance from small to large, and draw the sorted graph, which is called a K-Dist graph. The abscissa in the K-Dist graph represents the distance between the data object and its K-nearest object. The ordinate represents the number of data objects corresponding to a K-Dist distance value. R *-tree creation and K-Dist graph creation consume a lot of time. In addition, to obtain better clustering results, you must select a suitable EPS value based on the K-Dist graph. The DBSCAN algorithm performs clustering on the entire dataset without any preprocessing. When the data size is very large, a large amount of memory must be supported, and I/O consumption is also very large. The time complexity is O (nlogn) (N is the amount of data), and most of the time of the clustering process is used in regional query operations. The DBSCAN algorithm is very sensitive to the parameter EPS and minpts, and these two parameters are difficult to determine.
3.4 k-pototypes Algorithm
The K-pototypes algorithm combines the K-means method and the K-modes method improved based on the K-means method to process symbol attributes. Compared with the K-means method, the K-pototypes algorithm can process symbolic attributes.
3.5 CLARANS Algorithm
CLARANS is a random search clustering algorithm, which is a segmentation clustering method. First, it selects a random vertex as the current vertex, and then randomly checks the adjacent contacts that do not exceed the maxneighbor parameter. If a better neighbor contact is found, it is moved to the adjacent contact; otherwise, the point is used as the local minimum. Then, select a random vertex to find the local minimum volume of the other until the local minimum volume reaches the user's requirements. This algorithm requires that the clustering objects must be pre-tuned to the memory, and the dataset needs to be scanned multiple times. This means that the time complexity and space complexity are quite high for large data volumes. Although the R-tree structure is introduced to improve its performance and make it able to process disk-based large databases, the construction and maintenance of the r *-tree is too costly. This algorithm is not sensitive to dirty data and abnormal data, but is sensitive to the abnormal human order of data objects. It can only process convex or spherical border clustering.
3.6 clique Algorithm
The clique 9 method is an automatic sub-space clustering algorithm. This algorithm uses the top-up method to find the clustering units of each sub-space. The cuque algorithm is mainly used to find the low-dimensional clustering existing in the high-dimensional data space. In order to obtain the D-dimensional spatial clustering, all the D-1-dimension subspaces must be combined to present the clustering, resulting in low space and time efficiency of the algorithm. You must enter two parameters: the same interval distance and density of the data value. These two parameters are closely related to sample data, which is difficult to determine. The clique algorithm is not sensitive to the data input sequence.

4. Summary
Based on the above analysis, we get the comparison results of each clustering algorithm. The conclusion is shown in table 1.
Table 1 Comparison result table of clustering algorithms
Algorithm algorithm efficiency suitable data type discovery clustering type sensitivity to dirty data or abnormal data sensitivity to data input Order Sensitivity
The height of Birch is convex or spherical. It is insensitive.
DBSCAN is sensitive to arbitrary values.
The cure value is high. Any value is not sensitive and not sensitive.
K-poto common values and symbol convex or spherical sensitivity
CLARANS is less sensitive to convex or spherical values.
Lower cuque values are generally insensitive to convex or spherical values.


Because each method has its characteristics and different application fields, you should select an appropriate clustering algorithm based on your actual needs in data mining.

References: [Omitted]

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.