A super-awesome Clustering algorithm published by science

Source: Internet
Author: User

The author (Alex Rodriguez, Alessandro Laio) proposes a very concise and graceful clustering algorithm, which can identify clusters of various shapes, and its hyper-parameters are easily determined.

Algorithmic thinking

The algorithm assumes that the center of a cluster is surrounded by points with lower local densities and that these points are larger distances from other points with high local densities. First, two values are defined: local density ρi and distance δil to high local density points:

ρI=∑Jχ(DIJ−d c )

which

The DC is a truncated distance and is a hyper-parameter. So the ρi corresponds to the number of points that are less than the DC distance from point I. Since the algorithm is only sensitive to the relative value of the ρi, the choice of DC is robust, and a recommended approach is to select the DC so that the average number of neighbors per point is the 1%-2% of all points.

For the most dense point, set汛I=MAXJ(d iJ )
. Note that only those points whose density is local or global are larger than the normal adjacent point spacing.

Clustering process

The points with relatively large local density ρi and large δil are considered to be the center of the cluster. Smaller local densities but larger δil points are outliers. After the cluster center has been identified, all other points belong to the class cluster that is represented by the center of its nearest class cluster. The legend is as follows:

The left image is the distribution of all points in two-dimensional space, and the right image is the horizontal axis of ρ, with Δ as the ordinate, which is called the decision diagram (decision tree). As you can see, the ρi and Δil of 1 and 102 points are larger, as the center point of the cluster. 26, 27, 283 points of the Δil is also relatively large, but the ρi is smaller, so is the anomaly.

Cluster analysis

In cluster analysis, it is often necessary to determine the reliability of each point divided into a cluster. In this algorithm, you can first define a boundary area (border region) for each class cluster, that is, a point that is divided to that cluster but is less than the point of the other cluster. Then, for each cluster, it finds the point with the largest local density of its boundary region, making its local density ρh. All points in the cluster where the local density is greater than ρh are considered to be part of the core of the cluster (i.e., it is very reliable to divide the point into the cluster), and the remaining points are considered to be the halo of the cluster (halo), which can be considered noise. Diagrams such as the following

A graph is the probability distribution of the generated data, and the B, c two graphs generate 4000, 1000 points from the distribution, respectively. D, E is the B, C two group of data decision diagram (decision tree), you can see that the two groups of data are only five points have a relatively large ρi and a large δil. These points are the center of the cluster, after the center of the cluster is determined, each point is divided into clusters (colored dots), or clusters of halos (black dots). The F graph shows that the error rate of clustering decreases gradually as the number of sampling points increases, which indicates that the algorithm is robust.

Finally, it is very good to show the clustering effect of the algorithm on various data distributions.

Reference documents:

[1]. Clustering by fast search and find of density peak. Alex Rodriguez, Alessandro Laio

A super-awesome Clustering algorithm published by science

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.