An overview of density-based clustering algorithms recently, a density-based clustering algorithm in science, "clustering by fast search and find of density peaks" attracted attention (in my blog "The Machine Learning algorithm--the base The clustering algorithm for density peaks is also described in Chinese). So I want to understand the density-based clustering algorithm, familiar with the density-based clustering algorithm and distance-based clustering algorithm, such as the difference between the K-means algorithm. The main goal of density-based clustering algorithm is to find high-density regions separated by low-density regions. Unlike distance-based clustering algorithms, cluster results based on distance clustering are spherical clusters, and density-based clustering algorithms can discover arbitrary-shaped clusters, which plays an important role in data with noisy points. Second, the principle of DBSCAN algorithm 1, the basic concept of DBSCAN (density-based Spatial Clustering of application with Noise) is a typical density-based clustering algorithm, In the Dbscan algorithm, the data points are divided into three categories:
- Core point. In the RADIUS EPS contains more than minpts number of points
- The boundary point. The number of points within the radius EPS is less than minpts, but falls within the neighborhood of the core point
- Noise point. A point that is neither a core point nor a boundary point
Here there are two quantities, one is the radius eps, the other is the specified number minpts. Some of the other concepts
- EPS neighborhood. In simple terms, a collection of points that are less than or equal to the point of the EPS can be expressed as.
- Direct density can be reached. If within the EPS neighborhood of the core object, it is said that the object from the object is directly density can be reached.
- Density can be reached. For the object chain:, which is from the direct density of EPs and minpts, then the object is from the object about EPS and minpts density can be reached.
2, algorithm flow (flow) Three, experimental simulation in the experiment using two test data sets, the original image of the dataset is as follows: (DataSet 1) (DataSet 2) DataSet 1 is relatively simple. Obviously we can find data set 1 total two classes, DataSet 2 has four classes, below we use the Dbscan algorithm to achieve the clustering of data points: Matlab code main program
[Plain]View PlainCopy
- Percent DBSCAN
- Clear all;
- CLC
- Percent Import data set
- % data = load (' testData.txt ');
- data = Load (' testdata_2.txt ');
- % definition parameters EPs and minpts
- minpts = 5;
- Eps = Epsilon (data, minpts);
- [M,n] = size (data);
- x = [(1:m) ' data];
- [M,n] = size (x);% recalculate data set size
- types = zeros (1,m);% is used to distinguish core point 1, boundary Point 0 and noise point 1
- dealed = zeros (m,1);% is used to determine if the point has been processed, and 0 indicates that it has not been processed
- dis = caldistance (x (:, 2:n));
- Number = 1;% is used to mark classes
- Percent of each point to be processed
- For i = 1:m
- % found unhandled points
- If dealed (i) = = 0
- Xtemp = x (i,:);
- D = Dis (i,:);% gets the distance from point I to all other points
- IND = Find (d<=eps);% finds all points within the radius Eps
- The type of the percent difference point
- % Boundary point
- If Length (Ind) > 1 && Length (Ind) < minpts+1
- Types (i) = 0;
- Class (i) = 0;
- End
- % Noise Point
- If Length (ind) = = 1
- Types (i) =-1;
- Class (i) =-1;
- dealed (i) = 1;
- End
- % Core point (here is the key step)
- If Length (Ind) >= minpts+1
- Types (xtemp) = 1;
- Class (Ind) = number;
- % to determine if the core point density is up to
- While ~isempty (Ind)
- ytemp = x (Ind (1),:);
- Dealed (Ind (1)) = 1;
- IND (1) = [];
- D = Dis (ytemp),:);% find distance to IND (1)
- Ind_1 = Find (d<=eps);
- If Length (ind_1) >1% handles non-noise points
- Class (Ind_1) = number;
- If Length (ind_1) >= minpts+1
- Types (ytemp) = 1;
- Else
- Types (ytemp) = 0;
- End
- For J=1:length (ind_1)
- If Dealed (Ind_1 (j)) = = 0
- Dealed (Ind_1 (j)) = 1;
- Ind=[ind Ind_1 (j)];
- Class (Ind_1 (j)) =number;
- End
- End
- End
- End
- Number = number + 1;
- End
- End
- End
- % final processing of all unclassified points for noise points
- Ind_2 = Find (class==0);
- Class (Ind_2) =-1;
- Types (ind_2) =-1;
- To draw the final cluster diagram
- On
- For i = 1:m
- If class (i) = =-1
- Plot (data (i,1), data (i,2), '. R ');
- ElseIf Class (i) = = 1
- If types (i) = = 1
- Plot (data (i,1), data (i,2), ' +b ');
- Else
- Plot (data (i,1), data (i,2), '. B ');
- End
- ElseIf Class (i) = = 2
- If types (i) = = 1
- Plot (data (i,1), data (i,2), ' +g ');
- Else
- Plot (data (i,1), data (i,2), '. G ');
- End
- ElseIf Class (i) = = 3
- If types (i) = = 1
- Plot (data (i,1), data (i,2), ' +c ');
- Else
- Plot (data (i,1), data (i,2), '. C ');
- End
- Else
- If types (i) = = 1
- Plot (data (i,1), data (i,2), ' +k ');
- Else
- Plot (data (i,1), data (i,2), '. K ');
- End
- End
- End
- Hold off
Distance calculation function
[Plain]View PlainCopy
- Percent calculation the distance between the midpoint and the point of the matrix
- function [Dis] = caldistance (x)
- [M,n] = size (x);
- dis = zeros (m,m);
- For i = 1:m
- for j = i:m
- % calculation of Euclidean distance between point I and Point J
- TMP = 0;
- For k = 1:n
- TMP = tmp+ (x (I,k)-X (J,k)). ^2;
- End
- Dis (i,j) = sqrt (TMP);
- Dis (j,i) = Dis (I,J);
- End
- End
- End
Epsilon function
[Plain]View PlainCopy
- function [Eps]=epsilon (X,K)
- % Function: [Eps]=epsilon (X,k)
- %
- % Aim:
- % analytical to estimating neighborhood radius for DBSCAN
- %
- % Input:
- % X-data matrix (m,n); M-objects, N-variables
- % K-number of objects in a neighborhood of a object
- % (minimal number of objects considered as a cluster)
- [M,n]=size (x);
- Eps= ((PROD (max (x)-min (x)) *k*gamma (. 5*n+1))/(M*sqrt (Pi.^n))). ^ (1/n);
The final result (cluster result of DataSet 1) (cluster result of DataSet 2) in the above results, the red dots represent the noise points, the points represent the boundary points, and the cross represents the core points. Different colors represent different classes. References [1] M. Ester, H. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial Databa Ses with noise, www.dbs.informatik.uni-muenchen.de/cgi-bin/papers?query=--CO
[2] M. Daszykowski, B. Walczak, D. L. Massart, looking for Natural Patterns in Data. Part 1:density Based Approach
A simple and easy-to-learn machine learning algorithm--density-based clustering algorithm Dbscan