First, the algorithm thought:
DBSCAN (density-based Spatial Clustering of applications with Noise) is a relatively representative density-based clustering algorithm. Unlike the partitioning and hierarchical clustering methods, it defines clusters as the largest set of points connected by density, can divide the areas with sufficient density into clusters, and can discover arbitrary shapes in the noisy spatial database.
A few definitions in Dbscan:
- ε Neighborhood: The region of the given object radius is called the ε neighborhood of the object;
- Core object: If the number of sample points in the ε field of a given object is greater than or equal to minpts, the object is called the core object;
- Direct density : For sample Set D, if the sample point Q in the ε field of P, and P is the core object, then the object Q from the object P direct density can be reached.
- density: for sample Set D, given a bunch of sample points p1,p2....pn,p= p1,q= pn, if the object pi from the pi-1 direct density can reach, then the object Q from the object P density can be reached.
- Density Connection : There is a point O in the sample set D, if the object o to the object P and the object q are all densities, then the p and Q densities are associated.
It can be found that the density is up to the direct density of the transitive closure, and this relationship is asymmetric. Density is connected to a symmetric relationship. The Dbscan purpose is to find the largest set of density-connected objects.
Eg: Suppose the radius ε=3,minpts=3, point p of the E field is a bit {m,p,p1,p2,o}, point m of the E field a bit {m,q,p,m1,m2}, point Q of the E field a bit {q,m}, dot O in the E field a bit {o,p,s}, Dot S's e field is a bit {o,s,s1}.
Then the core object has P,m,o,s (q is not the core object, because it corresponds to the number of e-field midpoint equals 2, less than minpts=3);
The point m from the point P direct density can be reached, since M is in the E field of P, and P is the core object;
The point q from the point P-density can be reached, since the point Q from the point m direct density can reach and the point m from the point P direct density can reach;
The point Q points to the density of s, since the point Q from the point P-density can be reached and s from the point P-density.
Second, the advantages of the algorithm:
1. Compared to the K-means method, Dbscan does not need to know in advance the number of cluster classes to be formed.
2. Compared with the K-means method, Dbscan can find any shape of the cluster class.
3. At the same time, the Dbscan can identify the noise point. It has good robustness to outliers, and can even detect outlier points.
4.DBSCAN is not sensitive to the order of samples in the database, that is, the input order of pattern has little effect on the results. However, for a sample of the boundary between cluster classes, it is possible that the cluster class is first detected and its ownership swings.
5.DBSCAN is designed to be used with a database to speed up regional queries. For example, using the r* tree
Third, the disadvantages of the algorithm:
1. Dbscan does not reflect high-dimensional data well.
2. Dbscan does not reflect well the density of the data set with varying densities.
3. Because the Dbscan algorithm operates directly on the entire data set and needs to establish the corresponding r* tree before clustering, and draws the k-dist graph, the algorithm requires considerable memory space and I/O consumption. The efficiency of the DBSCAN algorithm will be greatly affected by the limited computational resources and the large amount of data. (The Dbscan algorithm takes all of the unhandled points from a region query as a seed point, leaving it to the next extension processing.) For larger classes in large datasets, this strategy expands the number of seed points, and the memory space required for the algorithm increases rapidly. )
4. Because the Dbscan algorithm uses a global characterization of density parameters, when the density of each class is uneven, or the distance between classes is very large, the quality of clustering is poor. (When the density of each class is uneven, or the distance between classes is very large, if you choose a smaller EPS value based on a higher density class, then the number of objects in the EPs neighborhood of the relatively low-density class will be small minpts, then these points will be mistaken for the boundary point, thus not being used for further expansion of the class. Thus, a class with a lower density is divided into classes with similar properties. Conversely, if you select larger EPS values based on a lower density class, the classes that are closer and denser are merged, and the differences between them are ignored. Therefore, it is difficult to select a suitable global EPS value to obtain more accurate clustering results under the above circumstances. )
5.DBSCAN is not completely deterministic, and the boundary points are obtained from different clusters, which can make a part of different clusters, depending on the data processing.
The mass of 6.DBSCAN depends on the measurement of the distance in the Regionquery (p,eps) function. The most commonly used distance metric is the Euclidean distance, especially in high-dimensional data, which is largely useless due to the so-called dimensional catastrophe, and it is difficult to find an appropriate value for E. Although there are some algorithms based on Euclidean distance, it is difficult to find a meaningful distance threshold e without a good understanding of data and scale.
7. When the density difference is large, because the selected minpts-eps combination can not be suitable for all clusters at the same time, DBSACN is not good at data clustering. (Disadvantage 4)
8. Input parameter sensitivity, determine the parameters of EPs, minpts difficult, if selected improperly, will result in clustering quality decline.
9. Because of the classical Dbscan algorithm, the parameters EPs and minpts are invariant in the clustering process, making the algorithm difficult to adapt to the uneven density data set.
Four, algorithm improvement:
1. Improvements to disadvantage 3: by selecting some points in the neighborhood of the core points as seed points to extend the class, it can greatly reduce the number of area queries, reduce I/o overhead, and realize fast clustering.
2. Improvements to disadvantage 4: In order to solve the above problems, Zhou and others proposed Pdbscan (partitioning-based DBSCAN) algorithm. Based on the data partitioning technology, the algorithm expands the Dbscan algorithm, which divides the entire data space into smaller partitions according to the distribution characteristics of the data, then clusters the local partitions, and finally merges the clustering results of each local. Pdbscan's algorithm thought is: First, according to the data set in one dimension or multiple dimensions distribution characteristic, divides the entire data space into several local regions, causes the data in each local region to distribute evenly, then plots each local area the K-dist graph, and then obtains each region's EPS value sequentially, Then the local clustering is carried out by using the DBSCAN algorithm, and finally, the results of each local cluster are combined to complete the cluster analysis of the whole data set. Because each local area uses its own local EPS values for clustering, the problem of clustering quality deterioration caused by the use of global EPS values is effectively mitigated.
3. Disadvantages 8 Improvements:
DBSCAN algorithm improvement, input parameter processing: for the DBSCAN algorithm to input parameters (cluster radius eps, cluster points minpts) sensitive problem, as follows. Since the parameter setting is usually dependent on experience, it is difficult to select an appropriate EPS value for clustering and get more accurate results when the data density is large and the distance between classes is uneven. Therefore, it is not advisable to determine the parameter value of the algorithm beforehand, and the parameters should be adjusted according to the result of clustering in the clustering process. For example, choose the appropriate evaluation function as the yardstick to evaluate the clustering results. The parameters of the algorithm are repeatedly adjusted and re-clustered until the clustering result satisfies the requirements. Although the Dbscan algorithm provides a visual method to draw a descending K-distance graph to select EPs, the selected EPS value is already relatively close to the "ideal" value; But there are often small gaps, resulting in a large difference in clustering results. The following methods may be considered for improvement:
(1) All cluster objects can be sorted in order from one cluster to another, by cluster edge → cluster core → cluster edge. In this way, the object sequence can reflect the density-based cluster structure information of the data space. Based on this information, it is easy to determine the appropriate EPS values and then discover each cluster.
(2) Do not cluster the original dataset, but rather by extracting high-density points from the data collection to generate a new data collection, and modify the density parameters, the process is repeated until the resulting data collection can be easily clustered, and then based on this result, and then the other points layer by level into each class. This avoids the effect of input parameters on clustering results in the Dbscan algorithm.
(3) Using the idea of nuclear clustering to non-linear transformation of the sample set, so that the distribution of the sample set as evenly as possible, through the mapping of the kernel function to show the original features not apparent, and then using global parameters EPs, so that better clustering, to get better results.
(4) In the case of the majority of clustering results are not ideal, is the value of the EPS selection is too small, resulting in a cluster of objects that should have been analyzed into multiple sub-clusters. Separated sub-clusters share some objects that can be thought of as interconnected by these shared objects. The Dbscan algorithm simply discards the connection information of the sub-cluster. Therefore, by recording all the cluster connection information, the user will merge the separated sub-clusters according to the actual clustering result and the cluster connection information. This can improve the effect of clustering, and the input parameters of EPS changes in the results of clustering, the final merge process is masked out. You can consider the following two ways of improving:
1) parallelization.
As can be seen from the Dbscan algorithm, the global variable EPS value affects the clustering quality, especially when the data distribution is not uniform. Therefore, consider dividing the data, the data distribution in each division is relatively uniform, The EPS value is selected based on the density of the data in each partition. This reduces the effect of the global variable EPS value, on the other hand, because of having multiple partitions, the parallel processing is considered to improve the clustering efficiency, and the higher memory requirements of the Dbscan algorithm are reduced.
.2) Incremental processing.
When data is added or deleted, only those classes that are affected by the data that it adds or deletes are considered. This approach is very effective when working with large datasets, without having to re-cluster the data in the database, only to incrementally update the classes, and to remediate and strengthen the discovered classes. In addition, because of the complexity of high-dimensional data, the efficiency and practicability of clustering analysis are very poor. The dimension of clustering space is reduced by determining the data dimension with strong correlation of clustering topic in clustering space. The use of data dimensionality reduction can reduce the complexity on the data structure. At present, a variety of dimensionality reduction techniques are
Can be used for the reduction of feature space. In the choice of method, according to the dimensionality reduction, the information loss rate is within the acceptable range, and a suitable dimensionality reduction method is chosen.
4. Improvements to disadvantage 9:
2.1 Adaptive selection of EPS parameters
For uneven data distribution, each data is similar to the surrounding data, therefore, for each point, the distance from the point closest to the distance average of points as a criterion for the density at that point, that is, at any point P, according to the distance matrix, select the nearest K points with P points, calculate the average distance. At this point, Each point is able to derive a K nearest point average distance.
Then, the average distance data of one dimension k nearest point of all points is dbscan clustered. The points of the maximum average distance of each class I in the cluster result are then found. Finally, the distance from the point to its point K is epsi as the neighborhood threshold of the class, and it is saved for clustering. This method of finding EPs mainly takes into account that data sets of different densities should be based on the density of each data. The appropriate thresholds were selected for clustering. Because the parameters used in clustering can only determine the density difference in the same class of data in the cluster results, the error caused by the parameter selection will not have a great effect on the clustering result.
2.2 DBSCAN clustering based on variable parameters
1) The neighborhood threshold values derived from 2.1 are epsi in order from small to large, ready for clustering;
2) Select the minimum neighborhood threshold, minpts can be unchanged, the data is dbscan clustered;
3) then using the next neighborhood threshold and minpts as parameters, the data labeled as noise is dbscan clustered again;
4) loop continuously until all neighborhood thresholds are used and the cluster ends.
In the process of multiple clustering, the neighborhood thresholds are clustered from small to large. When clustering with smaller thresholds, the data in a larger class is not processed because the threshold is not met, so a smaller distance threshold can only handle points with a higher density and will not affect small density data.
"Machine learning" DBSCAN algorithms density-based clustering algorithm