I. Overview of Algorithms
DBSCAN (density-based Spatial Clustering of applications with Noise) is a relatively representative density-based clustering algorithm. Unlike partitioning and hierarchical clustering methods, it defines clusters as the largest set of points connected by density, capable of dividing an area of sufficient density into clusters, and discovering arbitrary-shaped clusters in a noisy spatial database (the author believes that it is because he is not based on distance, and that distance-based discoveries are spherical clusters).
The algorithm utilizes the concept of density-based clustering, which requires that the number of objects (points or other spatial objects) within a certain area of a cluster space is not less than a given threshold value. The significant advantage of the Dbscan algorithm is that the clustering speed is fast and can effectively deal with the noise point and discover the spatial clustering of arbitrary shape. However, since it operates directly on the entire database and is clustered using a global characterization of density parameters, it also has two more obvious weaknesses:
(1) When the amount of data increases, the need for large memory support I/O consumption is also very large;
(2) When the density of spatial clustering is not uniform, cluster spacing difference is very large, clustering quality is poor (some clusters within a small distance, some clusters within a large distance, but the EPS is determined, so, the large points may be mistaken for outliers or boundary points, if the EPS is too large, then small distance of the vinegar inside, may contain outliers or boundary points, the KNN K also has the same problem).
(1) compared with K-means, there is no need to enter the number of clusters to be divided;
(2) The shape of cluster cluster is not biased (this does not understand what meaning);
(3) The parameters of filtering noise can be entered when needed;
two. Basic definition of the algorithm
three. Algorithm Description3.1 Algorithm Prerequisites
The Dbscan algorithm is based on the fact that a cluster can be uniquely determined by any of its core objects. Equivalence can be expressed as: any data object that satisfies the condition of the core object P, all of the data objects in database D from the P-density can be composed of a set of a complete cluster C, and p belongs to c.
3.2 Algorithmic Flow
four. Algorithm implementation
Percent dbscanclear all;clc;%% Import DataSet% data = Load (' testData.txt ');d ata = RANDN (50,2);% definition parameters eps and minptsminpts = 5; Eps = Epsilon (data, minpts); [M,n] = size (data),% gets the size of the data x = [(1:m) ' data]; [M,n] = size (x),% recalculate the size of the DataSet types = Zeros (1,m),% is used to differentiate between core point 1, boundary Point 0 and noise point -1dealed = Zeros (m,1),% is used to determine if the point has been processed, and 0 indicates that dis is not processed = Caldistance (x (:, 2:n)); number = 1;% is used to mark a class of percent of each point to be processed for i = 1:m% to find the unhandled point if dealed (i) = = 0 Xtemp = x (i,:); D = Dis (i,:);% gets the distance from point I to all other points ind = find (d<=eps);% find all points within RADIUS Eps the type of the dot difference point If Length (Ind) > 1 && Length (Ind) < minpts+1 types (i) = 0; Class (i) = 0; End% Noise point if length (ind) = = 1 types (i) =-1; Class (i) =-1; dealed (i) = 1; End% Core point (here is the key step) if Length (Ind) >= minpts+1 types (xtemp () = 1; Class (Ind) = number; % determine if the core point is density up to while ~isempty (IND) ytemp = X(Ind (1),:); Dealed (Ind (1)) = 1; IND (1) = []; D = Dis (ytemp,:);% found with IND (1) Distance ind_1 = find (d<=eps); If Length (ind_1) >1% handles non-noise point class (ind_1) = number; If Length (ind_1) >= minpts+1 types (ytemp () = 1; else types (ytemp) = 0; End for J=1:length (ind_1) if Dealed (Ind_1 (j)) = = 0 Dealed (Ind_1 (j)) = 1; Ind=[ind Ind_1 (j)]; Class (Ind_1 (j)) =number; End END End number = number + 1; End endend% finally handles all unclassified points as noise points ind_2 = Find (class==0), class (Ind_2) = -1;types (ind_2) = -1;%% draws the final cluster diagram hold onfor i = 1:m If class (i) = =-1Plot (data (i,1), data (i,2), '. R '); ElseIf Class (i) = = 1 if types (i) = = 1 plot (data (i,1), data (i,2), ' +b '); else plot (data (i,1), data (i,2), '. B '); End ElseIf Class (i) = = 2 if types (i) = = 1 plot (data (i,1), data (i,2), ' +g '); else plot (data (i,1), data (i,2), '. G '); End ElseIf Class (i) = = 3 if types (i) = = 1 plot (data (i,1), data (i,2), ' +c '); else plot (data (i,1), data (i,2), '. C '); End ELSE if types (i) = = 1 plot (data (i,1), data (i,2), ' +k '); else plot (data (i,1), data (i,2), '. K '); End Endendhold Off
What's The DA? .....
Percent calculation the distance between the midpoint and point of the Matrix function [dis] = caldistance (x) [M,n] = size (x); dis = zeros (m,m); For i = 1:m for j = i:m % calculates the Euclidean distance between point I and Point J tmp =0; For k = 1:n tmp = tmp+ (x (I,k)-X (J,k)). ^2; End dis (i,j) = sqrt (TMP); Dis (j,i) = Dis (i,j); End EndEnd
What's The DA? .....
function [Eps]=epsilon (x,k)% function: [Eps]=epsilon (x,k)% of Aim:% analytical of the estimating-in-the-neighborhood radius for DB scan%% Input:% x-data matrix (m,n); M-objects, n-variables% K-number of objects in a neighborhood of an object% (minimal number of objects considered as a C Luster) [M,n]=size (x); Eps= ((PROD (max (x)-min (x)) *k*gamma (. 5*n+1))/(M*sqrt (Pi.^n))). ^ (1/n);
Note: PROD is the product of elements within an array, and A^n is the n-th square of each element in a a*a*....*a,a.^n.
Dbscan algorithm of clustering based on density