Comparison of four clustering methods

Source: Internet
Author: User

Clustering analysis is an important human behavior. As early as childhood, a person learned how to distinguish cats, dogs, and animals by constantly improving the subconscious clustering model. It has been widely studied and successfully applied in many fields, such as pattern recognition, data analysis, image processing, market research, customer segmentation, and Web document classification.
Clustering is to divide a dataset into different classes or clusters according to a specific standard (such as the distance criterion), so that the similarity of Data Objects in the same cluster is as large as possible, at the same time, the differences between data objects in the same cluster are as big as possible. That is, after clustering, the data of the same class can be aggregated as much as possible, and different data can be separated as much as possible.
Clustering technology [2] is booming. Research areas that have contributed to this include data mining, statistics, machine learning, spatial database technology, biology, and marketing. Various clustering methods have been constantly proposed and improved, and different methods are suitable for different types of data. Therefore, it is worth studying to compare different clustering methods and clustering effects.
1. Classification of clustering algorithms
Currently, there are a large number of clustering algorithms [3]. For specific applications, the selection of clustering algorithms depends on the data type and the purpose of clustering. If clustering analysis is used as a description or exploration tool, you can try multiple algorithms for the same data to discover the possible results of the data.
The main clustering algorithms can be divided into the following types: division method, Hierarchy Method, density-based method, grid-based method, and model-based method [4-6].
There are widely used algorithms in each category, for example: k-means [7] Clustering Algorithm in classification method, Consortium hierarchical clustering algorithm [8] in hierarchical method, and neural network clustering algorithm based on model method.
At present, the study of clustering is not limited to the above hard clustering, that is, each data can only be classified as one type, and fuzzy clustering [10] is also a widely studied branch in cluster analysis. Fuzzy Clustering
Function is used to determine the degree to which each data belongs to each cluster, rather than forcibly classifying a data object into a cluster. Many algorithms have been proposed for fuzzy clustering, such as the famous FCM algorithm.
This article mainly compares and analyzes the clustering effect of k-means clustering algorithm, Consortium hierarchical clustering algorithm, neural network clustering algorithm SOM, and fuzzy clustering FCM algorithm through universal test dataset.
Research on four common Clustering Algorithms
2.1 k-means clustering algorithm

K-means is one of the classic clustering algorithms in the partitioning method. Because of its high efficiency, this algorithm is widely used in clustering large-scale data. Currently, many algorithms are expanded and improved around this algorithm.
The k-means algorithm uses k as a parameter to divide n objects into k clusters, so that the similarity between clusters is high and the similarity between clusters is low. The k-means algorithm is processed as follows: First, random
Select k objects. Each object initially represents the average value or center of a cluster. For each remaining object, based on its distance from each cluster center, assign it to the nearest cluster, and re-calculate the average value of each cluster.
This process repeats until the criterion function converges. Generally, the square error criterion is used and its definition is as follows:
 
Here E is the sum of square errors of all objects in the database, p is the point in space, and mi is the average value of cluster Ci [9]. The target function makes the cluster generated as compact and independent as possible. The distance measurement is Euclidean distance, and other distance measurements can also be used. The k-means clustering algorithm flow is as follows:
Input: number of databases and clusters containing n objects k;
Output: k clusters to minimize the square error criterion.
Steps:
(1) Select k objects as the initial cluster center;
(2) repeat;
(3) Assign each object (re) to the most similar cluster based on the average value of the objects in the cluster;
(4) Update the average value of the cluster, that is, calculate the average value of objects in each cluster;
(5) until will not change.
2.2 Hierarchical Clustering Algorithm
Based on whether the order of layered decomposition is from bottom to bottom or from top to bottom, hierarchical clustering algorithms are divided into clustering algorithms and split hierarchical clustering algorithms.
The aggregate hierarchical clustering policy is to first treat each object as a cluster, and then combine these atomic clusters into larger clusters until all objects are in one cluster, or an ending condition is met. The vast majority of hierarchical clusters belong to consortium-type hierarchical clusters. They only differ in the definitions of similarity between clusters. The four widely used inter-cluster distance measurement methods are as follows:

The following figure shows the process of using the minimum distance clustering hierarchical clustering algorithm:
(1) Think of each object as a class and calculate the minimum distance between two objects;
(2) Merge the two classes with the smallest distance into a new class;
(3) recalculate the distance between the new class and all classes;
(4) Repeat (2) and (3) until all classes are merged into one class.
2.3 SOM Clustering Algorithm
SOM neural network [11] was proposed by Professor Kohonen, a Finnish Neural Network expert. The algorithm assumes that some topology structures or sequences exist in the input object, and can be implemented from the input space (n-dimensional) dimension Reduction ing to the output plane (2 dimensions). The ing has the topological feature persistence property and has a strong theoretical relationship with the actual brain processing.
The SOM network includes the input layer and output layer. The input layer corresponds to a high-dimensional input vector. The output layer consists of a series of ordered nodes on the two-dimensional lattice. The input node and the output node are connected by the weight vector.
In the learning process, find the output layer unit with the shortest distance, that is, the winning unit, and update it. At the same time, the weights of neighboring regions are updated so that the output nodes maintain the topological features of the input vectors.
Algorithm flow:
(1) initialize the network and assign an initial value to the weight of each node on the output layer;
(2) randomly select the input vector from the input sample and find the weight vector with the smallest distance from the input vector;
(3) define the winning unit and adjust the weight in the adjacent area of the winning unit to bring it closer to the input vector;
(4) Provide new samples for training;
(5) scale down the radius of the neighborhood, reduce the learning rate, and repetition until the value is smaller than the allowed value, and output the clustering result.
2.4 FCM clustering algorithm
In 1965, Professor Zade of the University of California at Berkeley proposed the concept of 'collection' for the first time. After more than a decade of development, the fuzzy set theory has gradually been applied to various practical applications. In order to overcome the disadvantages of both types, clustering analysis based on fuzzy set theory has emerged. The method of fuzzy mathematics for clustering analysis is fuzzy clustering analysis [12].
The FCM algorithm is an algorithm that determines the degree to which each data point belongs to a certain degree of clustering. This clustering algorithm is an improvement of traditional hard clustering algorithms.

Algorithm flow:
(1) standardized data matrix;
(2) establish a fuzzy similarity Matrix and initialize the membership matrix;
(3) The algorithm starts iteration until the target function converges to the minimum value;
(4) based on the iteration result, the class to which the data belongs is determined by the final membership matrix, and the final clustering result is displayed.
Experiment on four Clustering Algorithms
3.1 Test Data

In this experiment, we select the IRIS [13] dataset from the universal UCI database, which is used to test classification and clustering algorithms. The IRIS dataset contains 150 sample data, which are taken from three different
Sample of flowers from setosa, versicolor, and virginica, which belong to the tail of Ying. Each data contains four attributes, namely, the length of the upper part of the population, the width of the upper part of the population, and the length of the lower part.
Run different clustering algorithms on the dataset to obtain clustering results with different precision.
3.2 Test Results
Based on the principles and flow of the preceding algorithms, matlab is used for programming. The clustering results shown in Table 1 are obtained.

As shown in table 1, the four clustering algorithms are compared in three aspects: (1) number of cluster error samples: Total number of cluster error samples, that is, the sum of the number of samples with various errors. (2) running time: clustering the entire
The time consumed by the process, in seconds; (3) average accuracy: Set the original dataset to k classes, use ci to represent Class I, and ni to represent the number of samples in ci, if mi is the correct number of clusters, mi/ni is
The average accuracy of the precision in Class I is:

3.3 Test Result Analysis

K-means and FCM are superior to other clustering algorithms in terms of running time and accuracy. However, each algorithm still has a fixed disadvantage: the beginning of the k-means clustering algorithm
The start point is randomly selected because the cluster result is unstable. In this experiment, although the average value is obtained from multiple experiments, however, the method for selecting specific initial points still needs to be further studied.
No need to determine the number of categories, but once a split or merge is executed, it cannot be corrected, and the clustering quality is restricted. FCM is sensitive to the initial clustering center and needs to determine the number of clusters manually, easy to fall into local optimum
SOM has a strong theoretical relationship with actual brain processing. However, the processing time is long and further research is needed to adapt it to large databases.
Due to its successful application in many fields, cluster analysis shows attractive application prospects. In addition to the classic clustering algorithm, various new clustering methods are constantly being proposed.
References
[1] HAN Jia Wei, kamber m. Concepts and technologies of Data Mining [M]. Fan Ming, Meng Xiaofeng, translated. Beijing: Press of Machinery Industry, 2001.
[2] Yang Xiaobing. Research on several key technologies in cluster analysis [D]. Hangzhou: Zhejiang University, 2005.
[3] XU Rui, Donald Wunsch 1 1. survey of clustering algorithm [J]. IEEE. Transactions on Neural Networks, 645 (3):-67 8.
[4] YI Hong, sam k. Learning assignment order of instances for
Constrained k-means clustering algorithm [J]. IEEE Transactions on
Systems, Man, and Cybernetics, Part B: Cybernetics, 2009,39 (2): 568-574.
[5] He Ling, Wu Lingda, Cai yichao. A summary of clustering algorithms in Data Mining [J]. Computer Application Research, 2007,24 (1): 10-13.
[6] sun jigui, Liu Jie, Zhao lianyu. Study on clustering algorithms [J]. Journal of software, (1): 48-61.
[7] Kong yinghui, yuanjinsha, Zhang tiefeng, et al.. Research on Distribution and load classification based on data stream management technology. China International Power Supply conference, CICED2006.
[8] Ma Xiaoyan, Tang Yan. Study on hierarchical clustering algorithms [J]. Computer Science, (7): 34-36.
[9] Wang Haibo, Zhang haichen, Duan xueli. Study on Self-organizing competitive neural network clustering based on MATLAB [J]. Journal of Xingtai Vocational and Technical College, (1): 45-47.
[10] Lu Xiaoyan, Luo Limin, Li Xiangsheng. Improvement of FCM Algorithm and simulation experiment [J]. Computer Engineering and application, 144 (20): 147.
[11] Li Ge, Shao Fengjing, Zhu benhao. Study Based on Neural Network clustering [J]. Journal of Qingdao University, 2001,16 (4): 21-24.
[12] Ge Guohua, Xiao Haibo, Zhang Min. Data Clustering Analysis and matlab implementation based on FCM [J]. Fujian computer, 2007,4: 89-90.
[13] fisher r a. Iris Plants Database // http://www.ics.uci.edu /~ Mlearn/MLRepository. Html. Authorized license.

Http://www.chinaaet.com/article/index.aspx? Id = 79936

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.