Comparison of various clustering algorithms of "reprint"

Source: Internet
Author: User

The goal of clustering is to make the similarity of the same class of objects as large as possible, and the similarity between non-homogeneous objects as small as possible. At present, there are many methods of clustering, according to the basic ideas, the clustering algorithm can be divided into five categories: Hierarchical Clustering algorithm, segmentation clustering algorithm, constraint-based clustering algorithm, machine learning clustering algorithm and high-dimensional clustering algorithm. A summary of the study of clustering analysis in data mining. 1. Hierarchical Clustering Algorithm 1.1 aggregation cluster1.1.1 Similarity based on distance:single-link: nearest distance,complete-link: Farthest distance,average-link : Average distance 1.1.2 Most representative algorithm

1)CURE algorithm

Characteristics: Fixed number of representative points of the common Representative class

Advantages: Identification of complex shapes, different sizes of clustering, filtering of isolated points

2)ROCK algorithm

Features: The improvement of CURE algorithm

Pros: Same as above, and data for category attributes

3)CHAMELEON algorithm

Features: The use of dynamic modeling technology

1.2 decomposition Clustering 1.3 Advantages and disadvantages

Advantages: Data sets for arbitrary shapes and arbitrary attributes, flexible control of cluster granularity at different levels, strong clustering capability

Cons: Greatly extend the execution time of the algorithm, not backtracking

2. Segmentation Clustering algorithm

2.1 Density-based clustering2.1.1 Features

The adjacent area with large density is connected, and the anomaly data can be handled effectively, which is mainly used for clustering of spatial data.

2.1.2 Typical algorithm

1)DBSCAN: Growing enough high-density areas

2)denclue: Clustering based on the density of the data points in the attribute space, the combination of density and grid and processing

3)OPTICS,dbclasd,curd: The different density of the data in the space is not correct DBSCAN made improvements.

2.2 clustering based on grid 2.2.1 features

Using the multidimensional grid data structure of attribute space, the space is divided into a finite number of cells to form a grid structure.

1 Advantages: processing time is independent of the number of data objects, independent of the input order of the data, and can handle any type of data

2 disadvantage: The processing time is correlated with the number of cells per dimension, which reduces the quality and accuracy of the cluster to a certain extent.

2.2.2 typical algorithm

1)STING: Based on the mesh multiresolution, divides the space into the square element, corresponds to the different resolution

2)sting+: Improved STINGfor dealing with dynamic evolution of spatial data

3) Clique: The idea of combining grid and density clustering can handle large-scale high-dimensional data

4)wavecluster: Based on the idea of signal processing

2.3 clustering based on graph theory2.3.1 Features

Conversion to combinatorial optimization problem, and using graph theory and related heuristic algorithm to solve, construct the minimum number of data set, and then gradually delete the longest edge

1) Advantages: No calculation of similarity is required

2.3.22 Main forms of application

1) based on the division of the Super graph

2) spectral-based graph partitioning

2.4 iterative redistribution clustering based on squared error2.4.1 Thought

The clustering results are optimized gradually, and the target data sets are redistributed to each cluster center to obtain the optimal solution.

2.4.2 Specific Algorithms

1 ) probability Clustering algorithm

Expect to maximize, be able to handle heterogeneous data, be able to handle records with complex structures, be able to process batches of data in a row, have online processing power, and produce clustering results that are easy to interpret

2 Nearest Neighbor Clustering algorithm--sharing nearest neighbor algorithm SNN

Features: Based on the density method and the ROCK idea, the K nearest neighbor is retained to simplify the similarity matrix and the number

Insufficient: Time complexity increased to O (n^2)

3 ) K-medioids algorithm

Feature: Use a point in the class to represent the cluster

Advantage: Ability to handle any type of attribute; not sensitive to abnormal data

4 ) K-means algorithm

1 " features : Cluster center with average representation of all data in each category

2 " Original K-means the defect of the algorithm : The result depends on the selection of the initial clustering center, the easy to get into local optimal solution, the choice of K value No criteria can be followed, the abnormal data is more sensitive, can only deal with the numerical properties of the data, clustering structure may be unbalanced

3 "K-means variants of

Bradley and Fayyad , etc.: Reduce the reliance on the center, can be applied to the large-scale data set

Dhillon : The method of re-computing in the process of adjusting the iteration, improving the performance

Zhang et: The iterative optimization process of weight-value soft distribution adjustment

Sarafis: Applying genetic algorithms to the construction of objective functions

Berkh in ET: Application extended to distributed clustering

Also: Using the concept of graph theory, balanced clustering results, the objective function in the original algorithm corresponds to an isotropic Gaussian mixture model

5 ) Pros and cons

Advantages: The most widely used, fast convergence speed, can be extended for large-scale data sets

Disadvantage: It tends to identify the clusters with convex distribution, similar size and density, and the central selection and noise clustering have great influence on the results.

3. Constraint-based Clustering algorithm 3.1 constraints

Constraints on individual objects and constraints on clustering parameters, both from the relevant field of experience and knowledge

3.2 Important Applications

The two-dimensional space in which the barrier data exists is clustered by data , such as COD (clustering with obstructed Distance): Replaces the general Euclidean distance with a barrier distance between two points

Insufficient 3.3

Typically only address specific requirements in a specific application area

4. Clustering algorithm for high-dimensional data 4.1 Difficult Source factors

1) The appearance of irrelevant attributes has caused the data to lose its clustering trend

2) The distinction between the boundaries becomes blurred

4.2 Workaround

1) dimensionality reduction for raw data

2) Sub-spatial clustering

CACTUS: Projection of primitive space on a two-dimensional plane

clique: Combining density and grid-based clustering ideas, using Apriori algorithm

3) Combined clustering Technology

Features: Clustering data points and attributes at the same time

Text: A algebra method based on two-way partition graph and its minimum partition

4.3 deficiency: Inevitably brings the loss of raw data information and the reduction of clustering accuracy

5. Clustering algorithm in machine learning two methods of 5.1

1) Artificial Neural Network method

Self-organizing mapping: vectorization, incremental processing; mapping to a two-dimensional plane for visualization

Artificial neural network clustering based on projection adaptive resonant theory

2) a method based on evolutionary theory

Defect: Depends on the selection of some empirical parameters, and has a high computational complexity

Simulated annealing: perturbation factor; genetic algorithm (selection, crossover, mutation)

5.2 Advantages and disadvantages

Advantages: Using the corresponding heuristic algorithm to obtain high-quality clustering results

Cons: High computational complexity, results dependent on selection of some empirical parameters

The following is a personal understanding of the selection and comparison of clustering algorithms:

Comparison of various clustering algorithms of "reprint"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.