A survey of grid clustering algorithms
(1)STING
STING(statistical information grid) is a grid-based multi-resolution clustering technology which divides the space region into a moment-type unit. For different levels of resolution, there are usually multiple levels of rectangular cells that form a hierarchy, and each cell in the upper layer is divided into several lower-level units. Statistics on the properties of each grid cell, such as average, maximum, and minimum, are pre-computed and stored. These statistics are useful for the query processing described below.
STING has several advantages: (1) grid-based computing is independent of query because the statistics stored in each cell provide the data in the cell that does not depend on the summary information of the query. (2) The grid structure is advantageous for parallel processing and incremental updating. (3) The efficiency is very high. STING scans the database once to calculate the statistics for the cell. Thus the time complexity of generating clustering is O (n), andn is the number of objects. n is the number of objects. After the hierarchy is established, the query processing time is O (g), where g is the lowest number of grid cells, usually much smaller than n.
(2)Wave Cluster
Wave Cluster is a multi-resolution clustering algorithm, which first summarizes data by imposing a multi-grid structure on the data space, then transforms the original feature space by a wavelet transform and finds dense areas in the transformed space. In this method, each grid cell summarizes a set of information that is mapped to a point in that cell. This summary information is suitable for the multi-resolution wavelet transform in memory and the subsequent clustering analysis.
wavelet transform is a signal processing technology, which decomposes a signal into sub-bands of different frequencies. By applying a dimensional wavelet transform n times, the wavelet model can be applied to N- dimensional signals. When the wavelet transform is performed, the data is transformed to preserve the relative distance between the objects at different resolution levels. This makes the natural clustering of data more easily distinguishable. Clustering can be determined by looking for high-density areas in the new space.
Wavelet transforms have the following advantages for clustering:
Provides no guidance for clustering. It uses cap-shaped filtering, emphasizing point-dense areas, while ignoring weaker information outside the dense zone. Thus, the dense area in the original feature space becomes the attraction point of the nearby point, and the farther point becomes the restraining point. This means that the clustering of the data is automatically displayed and. Clean. The surrounding area. Thus, another advantage of the wavelet transform is the ability to automatically exclude outliers. The multiresolution characteristics of wavelet transform are helpful for clustering detection at different levels of accuracy.
the clustering speed based on wavelet transform is very fast, and the computational complexity is O (n), where n is the number of objects in the database. This algorithm can be parallelized in advance.
(3)clique
The Clique Clustering algorithm combines grid-based and density-based clustering methods. It is very effective for clustering high-dimensional data in large-scale databases. The central idea of clique is as follows:
given a large collection of multidimensional data points, the data points are usually not evenly distributed in the data space. The clique distinguishes sparse and in space. It's crowded. Region (or cell) to discover the global distribution pattern of the data collection.
if the number of data points in a cell exceeds an input model parameter, the cell is dense. In clique , a cluster is defined as the largest set of contiguous dense cells.
(4)SCI
The SCI Clustering algorithm synthesizes density-based and grid-based clustering methods. Grid Partitioning method is similar to clique , by dividing each attribute of D data set D by equal order, first sorting the attributes into [Ij,uj],j=1,2,3,4,..., D, and then divide the Cheng cell by K-regular. The data space is divided into K cells of the same volume. So the grid is evenly divided.
in a cluster subspace, it obtains a broad outline of clusters by the technique of connecting dense cells. The total number of data points falling into each cell is considered the cell's density. Divides cells into 3 types, dense cells, sparse cells, and orphaned cells. Firstly, the entropy theorem is adopted to remove some properties of the clustering effect information, then the dense cells are connected to each other and separated by sparse cells to form the contour of the cluster. The orphaned cells are also separated by sparse cells, which are considered as outliers, and the points in the sparse cells may be the boundary points of the clusters or noise points that need further processing. The way to handle this is that for each data point in a sparse cell, if the cell closest to it is a dense cell, it is classified as a cluster, otherwise it is the noise data. Finally, a cluster is formed.
(5)MAFIA
MAFIA Clustering Algorithm based on density and grid-based clustering algorithm is synthesized. The grid partitioning method is based on the data distribution to determine the size of the grid cells, so the division of the grid is uneven.
in the a bottom-up subspace clustering technique is used in the MAFIA algorithm. The basic idea of this algorithm can be summarized as follows: According to the data distribution to divide the grid into the cell,the K- dimensional candidate high density cells are obtained by mergingThe high density units of any two (k-1) dimensions, and these two (k-1 The cells of the dimension have a common (k-2)-dimensional sub-unit, which is then clustered according to high-density cells.
the algorithm is suitable for high-and large-scale data sets, and its time complexity is exponentially increasing with the dimension. The advantage of this algorithm is that the user is not required to input the general grid parameters, the disadvantage is that the parameters are very sensitive, and the running time increases exponentially with the dimension number. By comparing with clique , it is concluded that MAFIA has better performance and better clustering quality, which is an improvement of clique clustering.
(6)Enclus
Enclus Clustering Algorithm is a clustering method based on grid. The dividing method of grid is equal to each dimension of data space, so the division of Grid is uniform.
in the in Enclus, a technique is used to find the clustering subspace: based on the value of the specified entropy, the effective subspace is found from the bottom up (starting from one dimension). The basic idea of the algorithm can be summarized as follows: based on the search effective subspace technique proposed by clique algorithm, a method of searching effective subspace based on entropy is proposed, and the entropy value of each subspace is calculated, if the value is lower than the specified entropy value, the unit is considered to be effective. In the effective subspace found, clustering can be done using existing clustering algorithms.
The time and space complexity of the algorithm are linear, similar to Clique algorithm. The advantage of Enclus algorithm is to propose an effective criterion of searching subspace based on entropy, which is highly efficient and the disadvantage is that it is very sensitive to parameters.
(7)dclust
Dclust Clustering algorithm based on density and grid-based clustering method is synthesized. The division of the grid is equal to each dimension of the data space, so the meshing is uniform.
The basic idea of the dclust algorithm can be summarized as follows: first divide the network, according to the density threshold to obtain high-density cells, the center of each high-density cell as its representative point, according to these representative points to construct the minimum spanning tree (r-mst) and outline structure, using R-mst for multi-resolution clustering and incremental clustering.
the time and space complexity of the algorithm are O (n), which has the advantage of being able to handle clusters of arbitrary shapes with noise, and is insensitive to the order of the data, and can handle incremental clustering. The Dclust algorithm mainly solves the problem that the traditional spatial clustering algorithm cannot deal with the incremental clustering effectively.
(8)mmng
MMNG Clustering algorithm is a clustering method based on grid. The Meshing method is divided by a P- tree data structure, and the grid is divided evenly.
The basic idea of the MMNG algorithm can be summed up as follows: Using a p -tree data structure to partition the dataset and compute the center point of each partition unit, the clustering is achieved, thus achieving an improvement of the MM algorithm.
the advantage of the algorithm is that when the number of data dimensions increases, MMNG the number of cluster centers to be evaluated is exponentially lower than the MM algorithm. This algorithm is mainly an improvement to the MM algorithm.
(9)GDILC
GDILC Clustering Algorithm is a clustering method based on grid. The dividing method of grid is equal to each dimension of data space, so the division of Grid is uniform.
The basic idea of the GDILC algorithm can be summed up as follows: Describes a grid-based contour cluster, that is, the same class of points on the same contour line, the distance of the adjacent contour line is less than a threshold value, then merge the corresponding classes of the two contours. The time complexity of the GDILC algorithm is linear, and the advantage of this algorithm is that it can be quickly and without guidance, and can recognize the isolated points and the clusters of various shapes well, the disadvantage is that the classes can not be separated well.
(ten mean approximation method for mesh clustering algorithm
Mesh Clustering Algorithm [ is a grid-based clustering method. The basic idea of this method can be summarized as follows: The mean approximation method of density-based clustering algorithm using the data space grid, it is effective to reduce the memory demand by replacing all the points in the original storage grid by a center of gravity point, and using an approximate density calculation to reduce the complexity of density calculation. The advantage of this algorithm is that it can reduce the memory requirement and greatly reduce the computational complexity by means of mean calculation. This algorithm is an improvement to the current clustering method based on grid and density.
( One ) Mobile grid Clustering algorithm
Mobile grid Clustering algorithm [11] is a combination of density-based and grid-based clustering methods. In the mobile grid clustering algorithm, the grid partitioning method is equal to each dimension of the data space, so the grid Division is uniform. The basic idea of the algorithm can be summed up as follows: On the basis of traditional grid clustering, using sliding window technology, we extend each grid to a half grid element to improve the accuracy of clustering. The advantage of the algorithm is that it does not require user input parameters, and has high precision, and the disadvantage is that time is very complicated.
This article is from the "Dream Factory" blog, please be sure to keep this source http://limlee.blog.51cto.com/6717616/1931556
A survey of grid clustering algorithms