Clique Clustering algorithm and Java implementation + multithreading

Source: Internet
Author: User
Tags terminates

Clique (clustering in QUEst) is a simple grid-based clustering method for discovering density-based clusters in subspace. Clique divides each dimension into non-overlapping intervals, thus dividing the entire embedded space of the data object into units. It uses a density threshold to identify dense cells and sparse cells. A cell is dense if the number of objects mapped to it exceeds the density threshold.

The main strategy of clique to identify candidate search spaces is to use dense cells about the monotonicity of the dimension. This is based on the apriori nature of the frequent pattern and association rule mining use. In the context of subspace clustering, the monotone statement is as follows:

A k-dimensional (>1) cell C has at least I points, only if each (k-1)-dimensional projection of C (which is (k-1)-Vitan) has at least 1 points. Consider, where the embedded data space contains 3 dimensions: Age,salary,vacation. For example, a two-dimensional cell in the subspace age and salary contains the L-dots, only if the cell is at least an L-point in each dimension (that is, the projection on age and salary, respectively).

Clique is clustered through two stages. In the first stage, clique divides the D-dimensional data space into several non-overlapping rectangular elements and identifies the dense cells from it. The clique finds dense cells in all the sub-spaces. To do this, clique divides each dimension into intervals and identifies an interval with at least l points, where L is the density threshold. Clique then iterates through the subspace. The number of points in the Clique check satisfies the density threshold. The iteration terminates when there is no candidate generation or the candidate is not dense.
In the second stage, clique uses dense cells in each subspace to assemble clusters that may have arbitrary shapes. The idea is to use the minimum description length (MDL) principle, using the largest area to cover the connected dense unit, where the largest area is a super-rectangle, and each cell in the area is dense, and the area can no longer be expanded on any dimension of that subspace. It is NP-hard to find the best description of a cluster generally. As a result, clique uses a simple greedy method. It starts with an arbitrary dense cell, finds the largest area that covers the cell, and continues the process on the remaining dense cells that have not been overwritten. The greedy method terminates when all dense cells are overwritten.

Finally, the Java implementation (multi-attribute clustering, multithreading) is given.

Https://github.com/HK-Zhang/wheats/tree/master/src/ClusterClique

Reference article: "Data Mining concepts and technologies" Jiawei Han

Clique Clustering algorithm and Java implementation + multithreading

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.