Clustering algorithm for Dbscan partitioning of high density region _

Clustering algorithm for Dbscan partitioning of high density region __ algorithm

Last Update:2018-07-31 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

On the first two articles of clustering algorithm, we have introduced the common prototype clustering algorithm K-MEASN algorithm and the clustering algorithm in the hierarchical cluster, this article introduces some density clustering algorithm dbscan. K-means algorithm needs to specify the number of clusters in advance, and the aggregation does not need to specify the number of clusters, the two algorithms will be divided into clusters of all samples, can not distinguish the noise, K-means algorithm cluster space is spherical, they can not well distinguish between high-density regions. This article mainly introduces the Dbscan algorithm in the clustering algorithm, it can be divided into the cluster space is arbitrary shape. Through this article you can learn:

1, what is the density of clustering

2, Dbscan algorithm

3, use Dbscan to divide the high density area First, density clustering

Density clustering is also called "Density based Clustering" (density-based clustering), and the algorithm assumes clustering results can be divided into clusters by the close degree of sample distribution. Density clustering algorithm considers the connectivity between samples from the angle of sample density, and obtains the final clustering results based on the continuous expansion of clustering clusters. second, Dbscan algorithm

Dbscan is a commonly used density clustering algorithm, and the density is defined as the number of sample points within the specified radius ε range. In Dbscan, each sample point is given a label:

Core point: If the number of other sample points is not less than the specified number (minpts) within the specified radius ε around a point, the point is called the core point.

Border point: Within a specified radius ε, a point is called a boundary point if its neighbor point is less than minpts but contains a core point.

Noise points (Noise point): points other than the core and boundary points are called noise points.

The Dbscan algorithm mainly consists of two steps:

1, based on each core point or a group of connected core points (if the core point is very close to the distance, it is considered to be connected) to form a separate cluster.

2, each boundary point is divided into its corresponding core point of the cluster.

Compared with the K-means algorithm, the Dbscan cluster space is not necessarily spherical, which is one of the advantages of Dbscan algorithm. Also, Dbscan clustering can identify and remove noise points, so it does not necessarily divide all the sample points into a single cluster. Below, we use the 1.5-month-shaped dataset to compare the clustering results of K-means clustering with condensed clustering and Dbscan , using Dbscan to partition high density data 1, acquiring data sets

Use the Sklearn dataset to obtain a high density half-moon dataset, containing a total of 200 sample points for 2 classes of samples, and adding some noise.

From sklearn.datasets import make_moons
import Matplotlib.pyplot as Plt

if __name__ = = "__main__":
    #获取半月形数据
    #获取200个点 and add some Gaussian noise to the data
    x,y = make_moons (n_samples=200,noise=0.09,random_state=0)
    plt.scatter (x[:,0) , x[:,1],c= "Blue", marker= "O")
    plt.show ()

2, using K-means algorithm for clustering

    #使用k-means algorithm for clustering from
    sklearn.cluster import Kmeans
    #初始化一个KMeans对象
    km = Kmeans (n_clusters=2,init= " k-means++ ", n_init=10,max_iter=300)
    #训练和预测
    y_km = km.fit_predict (x)
    #绘制属于1类的点
    plt.scatter (x[y_km= =0,0],x[y_km==0,1],c= "Green", marker= "O", label= "cluster 1")
    #绘制属于2类的点
    plt.scatter (x[y_km==1,0],x[y_km==1,1],c= "Red", marker= "s", label= "Cluster 2")
    #设置标题
    plt.title ("K-means cluster")
    plt.legend (
    ) plt.show ()

The above figure shows that the K-means algorithm is not good at separating the half-moon data. 3, the use of condensed clustering

    #使用凝聚聚类
    from sklearn.cluster import agglomerativeclustering
    #初始化一个凝聚聚类对象, using a fully connected mode of
    AC = Agglomerativeclustering (n_clusters=2,affinity= "Euclidean", linkage= "complete")
    #训练和预测
    ac_y = Ac.fit_ Predict (x)
    # draws points belonging to the 1 class
    Plt.scatter (x[ac_y==0,0],x[ac_y==0,1],c= "green", marker= "O", label= "cluster 1")
    # Draw Point
    Plt.scatter (x[ac_y==1,0],x[ac_y==1,1],c= "Red", marker= "s", label= "Cluster 2") # that belong to Class 2
    set the title
    Plt.title (" Agglomerate cluster ")
    plt.legend ()
    plt.show ()

Through the results, it is found that the aggregation clustering can not be well divided into two clusters. 4, using Dbscan algorithm for clustering

When using the Dbscan algorithm, you need to set a good radius and minpts two parameters

    #使用DBSCAN聚类 from
    sklearn.cluster import DBSCAN
    #初始化一个DBSCAN对象
    '
    EPS: Set radius size
    min_samples: Set the number of minpts '
    dbscan = Dbscan (eps=0.2,min_samples=5,metric= "Euclidean")
    #训练预测
    dbscan_y = Dbscan.fit_predict (x)
    # draws points belonging to the 1 class
    Plt.scatter (x[dbscan_y==0,0],x[dbscan_y==0,1],c= "green", marker= "O", label= "cluster 1")
    # Draws point plt.scatter belonging to Class 2
    (x[dbscan_y==1,0],x[dbscan_y==1,1],c= "Red", marker= "s", label= "Cluster 2")
    # Draw Noise point
    plt.scatter (x[dbscan_y==-1,0],x[dbscan_y==-1,1],c= "Blue", marker= "^", label= "noise point")
    # Set Caption
    plt.title ("Dbscan cluster")
    plt.legend ()
    plt.show ()

Through the results of dbscan, it can be found that this high-density data set is very good division, compared to K-means algorithm and clustering, it can also mark the noise point, the corresponding class mark is 1.

Conclusion: When the clustering algorithm is used to analyze the data, the dimension disaster will increase with the number of sample features. Especially when using the Euclidean distance as a measure of standard. When using the Dbscan algorithm, it is necessary to optimize the two minpts and radius, if the density difference in the dataset is relatively large, it is relatively difficult to find the appropriate radius and minpts. In addition to cluster analysis, Dbscan can also be used to remove noise. In practice, for a given dataset, it is difficult to determine which algorithm should be chosen, especially when the dimension of the data is high or the visualization analysis is difficult. A good clustering algorithm, in addition to relying on the algorithm and its parameters, for the selection of appropriate metrics is also important.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Clustering algorithm for Dbscan partitioning of high density region __ algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Clustering algorithm for Dbscan partitioning of high density region __ algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support