Clique (clustering in QUEst) is a simple grid-based clustering method for discovering density-based clusters in subspace. Clique divides each dimension into non-overlapping intervals, thus dividing the entire embedded space of the data object into units. It uses a density threshold to identify dense cells and sparse cells. A cell is dense if the number of objects mapped to it exceeds the density threshold.
The main strategy of clique to identify candidate search spaces is to use dense cells about the monotonicity of the dimension. This is based on the apriori nature of the frequent pattern and association rule mining use. In the context of subspace clustering, the monotone statement is as follows:
A k-dimensional (>1) cell C has at least I points, only if each (k-1)-dimensional projection of C (which is (k-1)-Vitan) has at least 1 points. Consider, where the embedded data space contains 3 dimensions: Age,salary,vacation. For example, a two-dimensional cell in the subspace age and salary contains the L-dots, only if the cell is at least an L-point in each dimension (that is, the projection on age and salary, respectively).
Clique is clustered through two stages. In the first stage, clique divides the D-dimensional data space into several non-overlapping rectangular elements and identifies the dense cells from it. The clique finds dense cells in all the sub-spaces. To do this, clique divides each dimension into intervals and identifies an interval with at least l points, where L is the density threshold. Clique then iterates through the subspace. The number of points in the Clique check satisfies the density threshold. The iteration terminates when there is no candidate generation or the candidate is not dense.
In the second stage, clique uses dense cells in each subspace to assemble clusters that may have arbitrary shapes. The idea is to use the minimum description length (MDL) principle, using the largest area to cover the connected dense unit, where the largest area is a super-rectangle, and each cell in the area is dense, and the area can no longer be expanded on any dimension of that subspace. It is NP-hard to find the best description of a cluster generally. As a result, clique uses a simple greedy method. It starts with an arbitrary dense cell, finds the largest area that covers the cell, and continues the process on the remaining dense cells that have not been overwritten. The greedy method terminates when all dense cells are overwritten.
Finally, the Java implementation (multi-attribute clustering, multithreading) is given.
Https://github.com/HK-Zhang/wheats/tree/master/src/ClusterClique
Reference article: "Data Mining concepts and technologies" Jiawei Han
Clique Clustering algorithm and Java implementation + multithreading