Clustering--Introduction
The main contents are as follows: A brief introduction of common clustering methods;
The methods of clustering are as follows: Hierarchical clustering, clustering based on grid, clustering based on density, clustering based on graph theory, clustering based on distance, grey clustering, fuzzy equivalence relation clustering; Web clustering based on keyword search1. Hierarchical Clustering algorithm 1.1 aggregation cluster1.1.1 Similarity varies according to distance:Single-link:Nearest distance,Complete-link: Farthest Distance,Average-link: Average distance1.1.2 Most representative algorithm
1)CURE algorithm
Characteristics: Fixed number of representative points of the common Representative class
Advantages: Identification of complex shapes, different sizes of clustering, filtering of isolated points
2)ROCK algorithm
Features: The improvement of CURE algorithm
Pros: Same as above, and data for category attributes
3)CHAMELEON algorithm
Features: The use of dynamic modeling technology
1.2 decomposition Clustering 1.3 Advantages and disadvantages
Advantages: Data sets for arbitrary shapes and arbitrary attributes, flexible control of cluster granularity at different levels, strong clustering capability
Cons: Greatly extend the execution time of the algorithm, not backtracking
2. Segmentation Clustering algorithm
2.1 Density-based clustering2.1.1 Features
The adjacent area with large density is connected, and the anomaly data can be handled effectively, which is mainly used for clustering of spatial data.
2.1.2 Typical algorithm
1)DBSCAN: Growing enough high-density areas
2)Denclue: Clustering based on the density of the data points in the attribute space, the combination of density and grid and processing
3)OPTICS,dbclasd,curd: All the different density of the data in the space is not the same as the DBSCAN has been improved
2.2 clustering based on grid 2.2.1 features
Using the multidimensional grid data structure of attribute space, the space is divided into a finite number of cells to form a grid structure.
1 Advantages: processing time is independent of the number of data objects, independent of the input order of the data, and can handle any type of data
2 disadvantage: The processing time is correlated with the number of cells per dimension, which reduces the quality and accuracy of the cluster to a certain extent.
2.2.2 typical algorithm
1)STING: Based on the mesh multiresolution, divides the space into the square element, corresponds to the different resolution
2)sting+: Improved STINGfor dealing with dynamic evolution of spatial data
3) Clique: The idea of combining grid and density clustering can handle large-scale high-dimensional data
4)Wavecluster: Based on the idea of signal processing
2.3 clustering based on graph theory2.3.1 Features
Conversion to combinatorial optimization problem, and using graph theory and related heuristic algorithm to solve, construct the minimum number of data set, and then gradually delete the longest edge
1) Advantages: No calculation of similarity is required
2.3.22 Main forms of application
1) based on the division of the Super graph
2) spectral-based graph partitioning
2.4 iterative redistribution clustering based on squared error2.4.1 Thought
The clustering results are optimized gradually, and the target data sets are redistributed to each cluster center to obtain the optimal solution.
2.4.2 Specific Algorithms
1 ) probability Clustering algorithm
Expect to maximize, be able to handle heterogeneous data, be able to handle records with complex structures, be able to process batches of data in a row, have online processing power, and produce clustering results that are easy to interpret
2 Nearest Neighbor Clustering algorithm--sharing nearest neighbor algorithm SNN
Features: Based on the density method and the ROCK idea, the K nearest neighbor is retained to simplify the similarity matrix and the number
Insufficient: Time complexity increased to O (n^2)
3 ) K-medioids algorithm
Feature: Use a point in the class to represent the cluster
Advantage: Ability to handle any type of attribute; not sensitive to abnormal data
4 ) K-means algorithm
1 " features : Cluster center with average representation of all data in each category
2 " Original K-means the defect of the algorithm : The result depends on the selection of the initial clustering center, the easy to get into local optimal solution, the choice of K value No criteria can be followed, the abnormal data is more sensitive, can only deal with the numerical properties of the data, clustering structure may be unbalanced
3 "K-means variants of
Bradley and Fayyad , etc.: Reduce the reliance on the center, can be applied to the large-scale data set
Dhillon : The method of re-computing in the process of adjusting the iteration, improving the performance
Zhang et: The iterative optimization process of weight-value soft distribution adjustment
Sarafis: Applying genetic algorithms to the construction of objective functions
Berkh in ET: Application extended to distributed clustering
Also: Using the concept of graph theory, balanced clustering results, the objective function in the original algorithm corresponds to an isotropic Gaussian mixture model
5 ) Pros and cons
Advantages: The most widely used, fast convergence speed, can be extended for large-scale data sets
Disadvantage: It tends to identify the clusters with convex distribution, similar size and density, and the central selection and noise clustering have great influence on the results.
3. Constraint-based Clustering algorithm 3.1 constraints
Constraints on individual objects and constraints on clustering parameters, both from the relevant field of experience and knowledge
3.2 Important Applications
The two-dimensional space in which the barrier data exists is clustered by data , such as COD (clustering with obstructed Distance): Replaces the general Euclidean distance with a barrier distance between two points
Insufficient 3.3
Typically only address specific requirements in a specific application area
4. Clustering algorithm for high-dimensional data 4.1 Difficult Source factors
1) The appearance of irrelevant attributes has caused the data to lose its clustering trend
2) The distinction between the boundaries becomes blurred
4.2 Workaround
1) dimensionality reduction for raw data
2) Sub-spatial clustering
CACTUS: Projection of primitive space on a two-dimensional plane
clique: Combining density and grid-based clustering ideas, using Apriori algorithm
3) Combined clustering Technology
Features: Clustering data points and attributes at the same time
Text: A algebra method based on two-way partition graph and its minimum partition
4.3 deficiency: Inevitably brings the loss of raw data information and the reduction of clustering accuracy
5. Clustering algorithm in machine learningtwo methods of 5.1
1) Artificial Neural Network method
Self-organizing mapping: vectorization, incremental processing; mapping to a two-dimensional plane for visualization
Artificial neural network clustering based on projection adaptive resonant theory
2) a method based on evolutionary theory
Defect: Depends on the selection of some empirical parameters, and has a high computational complexity
Simulated annealing: perturbation factor; genetic algorithm (selection, crossover, mutation)
5.2 Advantages and disadvantages
Advantages: Using the corresponding heuristic algorithm to obtain high-quality clustering results
Cons: High computational complexity, results dependent on selection of some empirical parameters
Clustering--Summary