I. Basic CONCEPTS
Dbscan algorithm
- Core object (a): If the density of a point reaches the threshold set by the algorithm, it is the core point (that is, the number of points within the R neighborhood is not less than minpts)
- Distance threshold for Neighborhood: Set radius r
- Direct density up to: If a point p in the R neighborhood of point Q and Q is the core point, then the p-q is direct density ( core object, in the neighborhood )
- Density up to: if there is a sequence of points q0, q1....qk, to any qi-qi-1 is direct density can reach, it is said from the q0 to qk density can be reached, which is actually direct density can reach the transmission ( direct density can reach )
- Density connection: If from a core point p, point Q and point k are density can be reached, then the point Q and point K is density connected
- Boundary Point (B, C): A non-core point belonging to a class that cannot be developed offline
- Noise Point (N): A point that does not belong to any one of the clusters, from any core point is a density unreachable
Two. Basic process
Algorithm flow: (data set, radius, density threshold)
- Mark all objects as not visited
- Randomly select an object that has not been accessed P, Mark p for the visited
- If a neighborhood of P has at least minpts objects
- Create a new cluster C and add p to C
- A collection of objects in a field that makes N p
- For each point in N: If P is not visited, Mark P is already visited. And if P's neighborhood has at least minpts objects, add these objects to n; If P is not a member of any cluster, add p to C
- Otherwise mark P for noise
- Until there are no objects marked as not visited
Parameter selection:
- Radius: can be set according to K Distance: To find the mutation point K distance
K Distance: p={p (i) for a given dataset, i=0,1,... n}, calculated point P (i) to a subset of Set D
Distance between the distances, in order from small to large, d (k) is called K-distance.
- minpts:k-distance from the value of K, generally take smaller, multiple attempts
Three. Pros and cons
Advantage
- You do not need to specify the number of clusters
- You can find clusters of any shape
- Good at finding outliers.
- Only two parameters
Disadvantage
- High-dimensional data can be difficult to do (dimensionality reduction)
- The parameters are difficult to select, but the effect on the results is very large
- Sklearn is slow (data reduction strategy)
Cluster analysis two: Dbscan algorithm