Dbscan Density Clustering

Source: Internet
Author: User
Tags extend

1. Density Clustering Concept

DBSCAN (density-based Spatial Clustering of applications with Noise, a density-based clustering method with noise) is a very typical density clustering algorithm, and K-means, Birch These are generally only applicable to convex sample sets of the cluster compared to the Dbscan can be applied to the convex sample set, but also suitable for non-convex sample set.

2. Density Clustering Steps

Dbscan algorithm Description: Input: Database containing n objects, radius e, minimum number of minpts; output: All generated clusters to meet density requirements. (1) Repeat (2) extracts an unhandled point from the database, and (3) if the point extracted is the core point then finds all the objects from that point density to form a cluster; (4) ELSE The point is the edge point (non-core object), out of the loop, looking for the next point; (5) UNTIL All the points are processed. Dbscan is sensitive to user-defined parameters, and subtle differences can lead to very different results, and the selection of parameters is not regular and can only be determined by experience.

The key to this algorithm is to understand several concepts:

    • Direct density up to
    • Density up to
    • Core points
    • Boundary Point
    • Noise Point
A material to understand these concepts: PPT3. Python implementationIdea: First find all the core points, the core points are those in the radius e within the neighborhood of the >=minpts points.  Note: All points within the core point are the same class as the core points! So if there's a point in a two-class point set that repeats, they should be merged into one category! Example: Category 1:[1,2,4,6,8], category 2:[3,6,7,9,10,99]. These two collections are initially two categories, but because there is a common denominator of 6, then they should be combined into 1 classes. So this algorithm is very simple, the code steps are as follows: 1) find out the distance matrix of all points dis=[n,n], n is the number of data. 2) If E is a value of 3, then the number of all points in each row of dis >3 and as long as >minpts, then 1 categories. 3) All of these categories are repeated, as long as duplicate values are merged until there are no duplicates. 4) These non-repeating categories are the final form of the category. Simple Description: Code:
# coding:utf-8 "" @author = LPS "" "Import NumPy as Npimport matplotlib.pyplot as Pltdata = Np.loadtxt (' moon.txt ') n,m = data. Shapeall_index = Np.arange (n) dis = Np.zeros ([n,n]) data = np.delete (data, M-1, Axis=1) def Dis_vec (A, B): # Calculates the distance between two vectors i F Len (a)!=len (b): Return Exception Else:return np.sqrt (Np.sum (Np.square (b))) for I in Range (n): # Calculates the distance Off matrix for J in range (i): dis[i,j] = Dis_vec (Data[i],data[j]) dis[j,i] = Dis[i,j]def Dbscan (s, minpts): #  Density clustering center_points = [] # holds the final cluster result k = 0 # Verify whether the merge process was performed for I in Range (n): If sum (Dis[i] <= s) >= Minpts: # To see if line I of the distance matrix satisfies the condition if len (center_points) = = 0: # If the list is 0, join the generated list directly CENTER_POINTS.A                    Ppend (List (all_index[dis[i] <= s)) Else:for J in Range (Len (center_points)): # Find out if there are duplicate elements If set (All_index[dis[i] <= s]) & Set (Center_points[j]): Center_points[j]. Extend (List (All_index[diS[i] <= s]) k=1 # performed a merge operation if K==0:center_points.append (Li St (All_index[dis[i] <= s]) # does not perform merge instructions This category is added separately k=0 Lenc = Len (center_points) # The following code is a further check, cent Er_points all the lists are not completely independent, there are a lot of repetition # then why the above code has been checked heavy, here also need to check the weight, in fact, you can put the above steps unified here, but the complexity of time and space is too high # after the first check, the number of elements in the center_points is greatly reduced,    Check the weight faster now! K = 0 for I in range (lenc-1): for J in Range (I+1, Lenc): If Set (Center_points[i]) & Set (Center_po INTS[J]): Center_points[j].extend (Center_points[i]) center_points[j] = List (set (Center_point S[J]) k=1 if k = = 1:center_points[i] = [] # The combined list is empty k = 0 Center_points =  [s for S in center_points if s! = []] # Delete empty list as final result return center_pointsif __name__ = = ' __main__ ': center_points = Dbscan (0.2,10) # radius and number of elements c_n = center_points.__len__ () # Number of categories after completion of clustering print (c_n) ct_point = [] color = [' g ', ' R ', ' B ', ' m ', ' k '] noise_point = Np.arange (n) # The point that does not participate in clustering is the noise point for I in range (c_n): Ct_point = List (set (center_points[ I])) Noise_point = set (Noise_point)-set (Center_points[i]) print (ct_point.__len__ ()) # outputs the number of points per class P Rint (ct_point) # Outputs each class of point print ("**********") Noise_point = List (noise_point) for I in range (c_n       ): Ct_point = List (set (Center_points[i])) Plt.scatter (data[ct_point,0], data[ct_point,1], Color=color[i]) # Draw different categories of points plt.scatter (data[noise_point,0], data[noise_point,1], color=color[c_n], marker= ' h ', linewidths=0.1) #  Draw Noise Point plt.show ()

The main advantages of Dbscan are:

1) It is possible to cluster dense datasets of arbitrary shapes, and the clustering algorithms, such as K-means, are generally only applicable to convex data sets.

2) outliers can be found at the same time as clustering, which is insensitive to outliers in the data set.

3) There is no bias in clustering results, and the initial values of clustering algorithm such as K-means have a great influence on the clustering results.

The main drawbacks of Dbscan are:

1) If the density of the sample set is not uniform, the difference between the cluster gap is very large, the clustering quality is poor, at this time with Dbscan clustering is generally not suitable.

2) If the sample set is large, the clustering convergence time is longer , at this time can be found in the nearest neighbor when the KD tree or the ball tree size limit to improve.

3) The tuning parameters are slightly more complex than the traditional clustering algorithms such as K-means, and the different parameter combinations have a great influence on the final clustering effect.

Experiment:

Original Square4 e=0.85 minpts = 13 Square4-sklearn e=0.9 minpts=15

Original Result diagram

Original Square1 1.185,8 Square1 0.85 15

Original result diagram

Original result diagram

The experimental process: the first few pictures due to the distribution of dense, parameter adjustment to many times, after a few pictures because the distribution is more dispersed, so the parameters are basically set up successfully.

Results and materials have been uploaded, download ~ ~ ~

Dbscan Density Clustering

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.