Cluster analysis two: Dbscan algorithm

Source: Internet
Author: User

I. Basic CONCEPTS

Dbscan algorithm

    • Core object (a): If the density of a point reaches the threshold set by the algorithm, it is the core point (that is, the number of points within the R neighborhood is not less than minpts)
    • Distance threshold for Neighborhood: Set radius r
    • Direct density up to: If a point p in the R neighborhood of point Q and Q is the core point, then the p-q is direct density ( core object, in the neighborhood )
    • Density up to: if there is a sequence of points q0, q1....qk, to any qi-qi-1 is direct density can reach, it is said from the q0 to qk density can be reached, which is actually direct density can reach the transmission ( direct density can reach )
    • Density connection: If from a core point p, point Q and point k are density can be reached, then the point Q and point K is density connected
    • Boundary Point (B, C): A non-core point belonging to a class that cannot be developed offline
    • Noise Point (N): A point that does not belong to any one of the clusters, from any core point is a density unreachable

Two. Basic process

Algorithm flow: (data set, radius, density threshold)

    • Mark all objects as not visited
    • Randomly select an object that has not been accessed P, Mark p for the visited
    • If a neighborhood of P has at least minpts objects
    1. Create a new cluster C and add p to C
    2. A collection of objects in a field that makes N p
    3. For each point in N: If P is not visited, Mark P is already visited. And if P's neighborhood has at least minpts objects, add these objects to n; If P is not a member of any cluster, add p to C
    • Otherwise mark P for noise
    • Until there are no objects marked as not visited

Parameter selection:

    • Radius: can be set according to K Distance: To find the mutation point K distance

      K Distance: p={p (i) for a given dataset, i=0,1,... n}, calculated point P (i) to a subset of Set D
      Distance between the distances, in order from small to large, d (k) is called K-distance.

    • minpts:k-distance from the value of K, generally take smaller, multiple attempts

Three. Pros and cons

Advantage

    • You do not need to specify the number of clusters
    • You can find clusters of any shape
    • Good at finding outliers.
    • Only two parameters

Disadvantage

    • High-dimensional data can be difficult to do (dimensionality reduction)
    • The parameters are difficult to select, but the effect on the results is very large
    • Sklearn is slow (data reduction strategy)

Cluster analysis two: Dbscan algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.