Clustering algorithm for Dbscan partitioning of high density region __ algorithm

Source: Internet
Author: User

On the first two articles of clustering algorithm, we have introduced the common prototype clustering algorithm K-MEASN algorithm and the clustering algorithm in the hierarchical cluster, this article introduces some density clustering algorithm dbscan. K-means algorithm needs to specify the number of clusters in advance, and the aggregation does not need to specify the number of clusters, the two algorithms will be divided into clusters of all samples, can not distinguish the noise, K-means algorithm cluster space is spherical, they can not well distinguish between high-density regions. This article mainly introduces the Dbscan algorithm in the clustering algorithm, it can be divided into the cluster space is arbitrary shape. Through this article you can learn:

1, what is the density of clustering

2, Dbscan algorithm

3, use Dbscan to divide the high density area First, density clustering

Density clustering is also called "Density based Clustering" (density-based clustering), and the algorithm assumes clustering results can be divided into clusters by the close degree of sample distribution. Density clustering algorithm considers the connectivity between samples from the angle of sample density, and obtains the final clustering results based on the continuous expansion of clustering clusters. second, Dbscan algorithm

Dbscan is a commonly used density clustering algorithm, and the density is defined as the number of sample points within the specified radius ε range. In Dbscan, each sample point is given a label:

Core point: If the number of other sample points is not less than the specified number (minpts) within the specified radius ε around a point, the point is called the core point.

Border point: Within a specified radius ε, a point is called a boundary point if its neighbor point is less than minpts but contains a core point.

Noise points (Noise point): points other than the core and boundary points are called noise points.

The Dbscan algorithm mainly consists of two steps:

1, based on each core point or a group of connected core points (if the core point is very close to the distance, it is considered to be connected) to form a separate cluster.

2, each boundary point is divided into its corresponding core point of the cluster.

Compared with the K-means algorithm, the Dbscan cluster space is not necessarily spherical, which is one of the advantages of Dbscan algorithm. Also, Dbscan clustering can identify and remove noise points, so it does not necessarily divide all the sample points into a single cluster. Below, we use the 1.5-month-shaped dataset to compare the clustering results of K-means clustering with condensed clustering and Dbscan , using Dbscan to partition high density data 1, acquiring data sets

Use the Sklearn dataset to obtain a high density half-moon dataset, containing a total of 200 sample points for 2 classes of samples, and adding some noise.

From sklearn.datasets import make_moons
import Matplotlib.pyplot as Plt

if __name__ = = "__main__":
    #获取半月形数据
    #获取200个点 and add some Gaussian noise to the data
    x,y = make_moons (n_samples=200,noise=0.09,random_state=0)
    plt.scatter (x[:,0) , x[:,1],c= "Blue", marker= "O")
    plt.show ()

2, using K-means algorithm for clustering

    #使用k-means algorithm for clustering from
    sklearn.cluster import Kmeans
    #初始化一个KMeans对象
    km = Kmeans (n_clusters=2,init= " k-means++ ", n_init=10,max_iter=300)
    #训练和预测
    y_km = km.fit_predict (x)
    #绘制属于1类的点
    plt.scatter (x[y_km= =0,0],x[y_km==0,1],c= "Green", marker= "O", label= "cluster 1")
    #绘制属于2类的点
    plt.scatter (x[y_km==1,0],x[y_km==1,1],c= "Red", marker= "s", label= "Cluster 2")
    #设置标题
    plt.title ("K-means cluster")
    plt.legend (
    ) plt.show ()

The above figure shows that the K-means algorithm is not good at separating the half-moon data. 3, the use of condensed clustering

    #使用凝聚聚类
    from sklearn.cluster import agglomerativeclustering
    #初始化一个凝聚聚类对象, using a fully connected mode of
    AC = Agglomerativeclustering (n_clusters=2,affinity= "Euclidean", linkage= "complete")
    #训练和预测
    ac_y = Ac.fit_ Predict (x)
    # draws points belonging to the 1 class
    Plt.scatter (x[ac_y==0,0],x[ac_y==0,1],c= "green", marker= "O", label= "cluster 1")
    # Draw Point
    Plt.scatter (x[ac_y==1,0],x[ac_y==1,1],c= "Red", marker= "s", label= "Cluster 2") # that belong to Class 2
    set the title
    Plt.title (" Agglomerate cluster ")
    plt.legend ()
    plt.show ()

Through the results, it is found that the aggregation clustering can not be well divided into two clusters. 4, using Dbscan algorithm for clustering

When using the Dbscan algorithm, you need to set a good radius and minpts two parameters

    #使用DBSCAN聚类 from
    sklearn.cluster import DBSCAN
    #初始化一个DBSCAN对象
    '
    EPS: Set radius size
    min_samples: Set the number of minpts '
    dbscan = Dbscan (eps=0.2,min_samples=5,metric= "Euclidean")
    #训练预测
    dbscan_y = Dbscan.fit_predict (x)
    # draws points belonging to the 1 class
    Plt.scatter (x[dbscan_y==0,0],x[dbscan_y==0,1],c= "green", marker= "O", label= "cluster 1")
    # Draws point plt.scatter belonging to Class 2
    (x[dbscan_y==1,0],x[dbscan_y==1,1],c= "Red", marker= "s", label= "Cluster 2")
    # Draw Noise point
    plt.scatter (x[dbscan_y==-1,0],x[dbscan_y==-1,1],c= "Blue", marker= "^", label= "noise point")
    # Set Caption
    plt.title ("Dbscan cluster")
    plt.legend ()
    plt.show ()

Through the results of dbscan, it can be found that this high-density data set is very good division, compared to K-means algorithm and clustering, it can also mark the noise point, the corresponding class mark is 1.

Conclusion: When the clustering algorithm is used to analyze the data, the dimension disaster will increase with the number of sample features. Especially when using the Euclidean distance as a measure of standard. When using the Dbscan algorithm, it is necessary to optimize the two minpts and radius, if the density difference in the dataset is relatively large, it is relatively difficult to find the appropriate radius and minpts. In addition to cluster analysis, Dbscan can also be used to remove noise. In practice, for a given dataset, it is difficult to determine which algorithm should be chosen, especially when the dimension of the data is high or the visualization analysis is difficult. A good clustering algorithm, in addition to relying on the algorithm and its parameters, for the selection of appropriate metrics is also important.



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.