A super-awesome Clustering algorithm published by science

Last Update:2016-05-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The author (Alex Rodriguez, Alessandro Laio) proposes a very concise and graceful clustering algorithm, which can identify clusters of various shapes, and its hyper-parameters are easily determined.

Algorithmic thinking

The algorithm assumes that the center of a cluster is surrounded by points with lower local densities and that these points are larger distances from other points with high local densities. First, two values are defined: local density ρi and distance δil to high local density points:

ρI=∑Jχ(DIJ−d c )

which

The DC is a truncated distance and is a hyper-parameter. So the ρi corresponds to the number of points that are less than the DC distance from point I. Since the algorithm is only sensitive to the relative value of the ρi, the choice of DC is robust, and a recommended approach is to select the DC so that the average number of neighbors per point is the 1%-2% of all points.

For the most dense point, set汛I=MAXJ(d iJ )
. Note that only those points whose density is local or global are larger than the normal adjacent point spacing.

Clustering process

The points with relatively large local density ρi and large δil are considered to be the center of the cluster. Smaller local densities but larger δil points are outliers. After the cluster center has been identified, all other points belong to the class cluster that is represented by the center of its nearest class cluster. The legend is as follows:

The left image is the distribution of all points in two-dimensional space, and the right image is the horizontal axis of ρ, with Δ as the ordinate, which is called the decision diagram (decision tree). As you can see, the ρi and Δil of 1 and 102 points are larger, as the center point of the cluster. 26, 27, 283 points of the Δil is also relatively large, but the ρi is smaller, so is the anomaly.

Cluster analysis

In cluster analysis, it is often necessary to determine the reliability of each point divided into a cluster. In this algorithm, you can first define a boundary area (border region) for each class cluster, that is, a point that is divided to that cluster but is less than the point of the other cluster. Then, for each cluster, it finds the point with the largest local density of its boundary region, making its local density ρh. All points in the cluster where the local density is greater than ρh are considered to be part of the core of the cluster (i.e., it is very reliable to divide the point into the cluster), and the remaining points are considered to be the halo of the cluster (halo), which can be considered noise. Diagrams such as the following

A graph is the probability distribution of the generated data, and the B, c two graphs generate 4000, 1000 points from the distribution, respectively. D, E is the B, C two group of data decision diagram (decision tree), you can see that the two groups of data are only five points have a relatively large ρi and a large δil. These points are the center of the cluster, after the center of the cluster is determined, each point is divided into clusters (colored dots), or clusters of halos (black dots). The F graph shows that the error rate of clustering decreases gradually as the number of sampling points increases, which indicates that the algorithm is robust.

Finally, it is very good to show the clustering effect of the algorithm on various data distributions.

Reference documents:

[1]. Clustering by fast search and find of density peak. Alex Rodriguez, Alessandro Laio

A super-awesome Clustering algorithm published by science

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

A super-awesome Clustering algorithm published by science

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

A super-awesome Clustering algorithm published by science

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support