Original link: http://www.cnblogs.com/chaosimple/p/3164775.html#undefined
1, Dbscan Introduction
DBSCAN (density-based spatial clustering of applications with Noise, a density-based clustering method with noise) is a spatial clustering algorithm based on density. The algorithm divides the areas with sufficient density into clusters and discovers any shape clusters in a noisy spatial database, which defines clusters as the largest set of points connected by density.
The algorithm utilizes the concept of density-based clustering, which requires that the number of objects (points or other spatial objects) within a certain area of a cluster space is not less than a given threshold value. The significant advantage of the Dbscan algorithm is that the clustering speed is fast and can effectively deal with the noise point and discover the spatial clustering of arbitrary shape. However, since it operates directly on the entire database and is clustered using a global characterization of density parameters, it also has two more obvious weaknesses:
(1) When the amount of data increases, the need for large memory support I/O consumption is also very large;
(2) When the density of spatial clustering is uneven and the difference of cluster spacing is very large, the clustering quality is poor.
2. Comparison of Dbscan and traditional clustering algorithms
The purpose of the Dbscan algorithm is to filter low-density areas and find dense sample points. Unlike traditional hierarchical clustering and clustering-based convex clusters, the algorithm can find clusters of arbitrary shapes, which has the following advantages compared with traditional algorithms:
(1) compared with K-means, there is no need to enter the number of clusters to be divided;
(2) The shape of cluster cluster is not biased;
(3) The parameters of filtering noise can be entered when needed;
3, the basic definition of the algorithm involves:
( 1 ) Neighborhood : The area within a given object radius is called the neighborhood of the object.
( 2 Core Object : If the number of sample points in the neighborhood of a given object is greater than or equal to minpts, the object is called a core object.
( 3 direct density up to : Given an object Set D, if P is in the neighborhood of Q, and Q is a core object, then we say that the object p from the object q is directly density can be reached (directly density-reachable).
( 4 density up to : For sample Set D, if there is an object chain, for, is from the about and minpts direct density can be reached, then the object P is from the object Q about and minpts density can be reached (density-reachable).
( 5 density is connected : If an object is present, so that the object P and Q are both from O and minpts density, then the object p to Q is about the minpts density (density-connected).
It can be found that the density is up to the direct density of the transitive closure, and this relationship is asymmetric. Only the core objects are denser than each other. However, density is connected to a symmetric relationship. DBSCAN The goal is to find the largest set of density-connected objects.
4. Clustering process of Dbscan algorithm
The Dbscan algorithm is based on the fact that a cluster can be uniquely determined by any of its core objects . Equivalence can be expressed as: any data object that satisfies the condition of the core object P, all of the data objects in database D from the P- density can be composed of a set of a complete cluster C, and p belongs to c.
The specific clustering process for the algorithm is as follows:
Scan the entire data set to find any core point and expand the core point. The method of expansion is to find all the density-linked data points from the core point (note that the density is connected). Traverse All the core points in the neighborhood of the core point (because the boundary points are not extensible) and look for points that are connected to these data point densities until there are no data points to expand. Finally, the boundary nodes of clusters are non-core data points. Then you re-scan the dataset (excluding any data points in the cluster you were looking for), look for core points that are not clustered, and repeat the steps above to expand the core point until there is no new core point in the dataset. Data points that are not contained in any cluster in the dataset constitute an anomaly.
5. Algorithm pseudo-code
Algorithm Description:
Algorithm: DBSCAN
Input: e--radius
The minimum number of neighbors for the minpts--to be the core object within the E neighborhood.
d--collection.
Output: Target class Cluster collection
Method: Repeat
1) Determine if the input point is a core object
2) Find out all the direct density points in the E neighborhood of the core object.
Until all input points are judged
Repeat
A collection of maximum density connected objects is found for all direct density points within the E neighborhood of all core objects, involving a combination of density-able objects.
Until all the core object's E Realms are traversed
"Turn" common clustering Algorithm (a) Dbscan algorithm