Dbscan algorithm
Basic Concept :(density-based Spatial clustering of applications with Noise)
Core Object : If the density of a point reaches the threshold value set by the algorithm, it is the core point. (That is, the number of points within the R neighborhood is not less than minpts)
distance threshold for ε neighborhood : Set radius r
direct density can reach: If a point P in the R neighborhood of Point Q, and Q is the core point p-q direct density can be reached.
density: If there is a sequence of points q0, Q1 、... qk, the direct density of any qi-qi-1 is reached, it is said from q0 to qk density can be reached, which is actually a direct density of "transmission." Just like MLM, the development of the downline.
Density connection : If from a core point p, point Q and point k are density can be reached, then the point Q and point K is density connected.
Boundary Point : a non-core point belonging to a class that cannot be developed offline
Noise point : A point that does not belong to any one of the clusters, from any core point is the density is not reached, also known as outliers .
Work flow
Given:
Parameter d: input data set
Parameter ε: Specify RADIUS
Minpts: Density threshold (e.g. 5)
Parameter selection:
Radius ε, can be set according to K distance: find the mutation point
K Distance: The given DataSet p={p (i); i=0,1,... n}, calculates the distance between points P (i) and the subset S of Set D, the distance is sorted from small to large, and D (k) is called K-distance.
minpts::k-distance from the value of K, generally take smaller, multiple attempts
Advantage:
- No need to specify the number of clusters
- You can find clusters of any shape
- Good at finding outliers (detection tasks)
- Two parameters is enough.
Disadvantage:
- High-dimensional data can be difficult to do (dimensionality reduction)
- Parameters are difficult to select (parameters have a very large effect on the result)
- Sklearn is slow (data reduction strategy)
Machine learning--Clustering series--dbscan algorithm