Dbscan algorithm of clustering based on density

Source: Internet
Author: User

I. Overview of Algorithms

DBSCAN (density-based Spatial Clustering of applications with Noise) is a relatively representative density-based clustering algorithm. Unlike partitioning and hierarchical clustering methods, it defines clusters as the largest set of points connected by density, capable of dividing an area of sufficient density into clusters, and discovering arbitrary-shaped clusters in a noisy spatial database (the author believes that it is because he is not based on distance, and that distance-based discoveries are spherical clusters).

The algorithm utilizes the concept of density-based clustering, which requires that the number of objects (points or other spatial objects) within a certain area of a cluster space is not less than a given threshold value. The significant advantage of the Dbscan algorithm is that the clustering speed is fast and can effectively deal with the noise point and discover the spatial clustering of arbitrary shape. However, since it operates directly on the entire database and is clustered using a global characterization of density parameters, it also has two more obvious weaknesses:

(1) When the amount of data increases, the need for large memory support I/O consumption is also very large;

(2) When the density of spatial clustering is not uniform, cluster spacing difference is very large, clustering quality is poor (some clusters within a small distance, some clusters within a large distance, but the EPS is determined, so, the large points may be mistaken for outliers or boundary points, if the EPS is too large, then small distance of the vinegar inside, may contain outliers or boundary points, the KNN K also has the same problem).

(1) compared with K-means, there is no need to enter the number of clusters to be divided;

(2) The shape of cluster cluster is not biased (this does not understand what meaning);

(3) The parameters of filtering noise can be entered when needed;

two. Basic definition of the algorithm

three. Algorithm Description3.1 Algorithm Prerequisites

The Dbscan algorithm is based on the fact that a cluster can be uniquely determined by any of its core objects. Equivalence can be expressed as: any data object that satisfies the condition of the core object P, all of the data objects in database D from the P-density can be composed of a set of a complete cluster C, and p belongs to c.

3.2 Algorithmic Flow

four. Algorithm implementation
Percent dbscanclear all;clc;%% Import DataSet% data = Load (' testData.txt ');d ata = RANDN (50,2);% definition parameters eps and minptsminpts = 5; Eps = Epsilon (data, minpts); [M,n] = size (data),% gets the size of the data x = [(1:m) ' data]; [M,n] = size (x),% recalculate the size of the DataSet types = Zeros (1,m),% is used to differentiate between core point 1, boundary Point 0 and noise point -1dealed = Zeros (m,1),% is used to determine if the point has been processed, and 0 indicates that dis is not processed =        Caldistance (x (:, 2:n)); number = 1;% is used to mark a class of percent of each point to be processed for i = 1:m% to find the unhandled point if dealed (i) = = 0 Xtemp = x (i,:);        D = Dis (i,:);% gets the distance from point I to all other points ind = find (d<=eps);% find all points within RADIUS Eps the type of the dot difference point            If Length (Ind) > 1 && Length (Ind) < minpts+1 types (i) = 0;        Class (i) = 0;            End% Noise point if length (ind) = = 1 types (i) =-1;            Class (i) =-1;        dealed (i) = 1;            End% Core point (here is the key step) if Length (Ind) >= minpts+1 types (xtemp () = 1;                        Class (Ind) = number; % determine if the core point is density up to while ~isempty (IND) ytemp = X(Ind (1),:);                Dealed (Ind (1)) = 1;                IND (1) = [];                                D = Dis (ytemp,:);% found with IND (1) Distance ind_1 = find (d<=eps);                    If Length (ind_1) >1% handles non-noise point class (ind_1) = number;                    If Length (ind_1) >= minpts+1 types (ytemp () = 1;                    else types (ytemp) = 0;                          End for J=1:length (ind_1) if Dealed (Ind_1 (j)) = = 0                          Dealed (Ind_1 (j)) = 1;                             Ind=[ind Ind_1 (j)];                       Class (Ind_1 (j)) =number;        End END End number = number + 1;    End endend% finally handles all unclassified points as noise points ind_2 = Find (class==0), class (Ind_2) = -1;types (ind_2) = -1;%% draws the final cluster diagram hold onfor i = 1:m        If class (i) = =-1Plot (data (i,1), data (i,2), '. R ');        ElseIf Class (i) = = 1 if types (i) = = 1 plot (data (i,1), data (i,2), ' +b ');        else plot (data (i,1), data (i,2), '. B ');        End ElseIf Class (i) = = 2 if types (i) = = 1 plot (data (i,1), data (i,2), ' +g ');        else plot (data (i,1), data (i,2), '. G ');        End ElseIf Class (i) = = 3 if types (i) = = 1 plot (data (i,1), data (i,2), ' +c ');        else plot (data (i,1), data (i,2), '. C ');        End ELSE if types (i) = = 1 plot (data (i,1), data (i,2), ' +k ');        else plot (data (i,1), data (i,2), '. K '); End Endendhold Off

What's The DA? .....

Percent calculation the distance between the midpoint and point of the Matrix function [dis] = caldistance (x)    [M,n] = size (x);    dis = zeros (m,m);    For i = 1:m        for j = i:m            % calculates the Euclidean distance between point I and Point J            tmp =0;            For k = 1:n                tmp = tmp+ (x (I,k)-X (J,k)). ^2;            End            dis (i,j) = sqrt (TMP);            Dis (j,i) = Dis (i,j);        End    EndEnd

What's The DA? .....

function [Eps]=epsilon (x,k)% function: [Eps]=epsilon (x,k)% of Aim:% analytical of the estimating-in-the-neighborhood radius for DB scan%% Input:% x-data matrix (m,n); M-objects, n-variables% K-number of objects in a neighborhood of an object% (minimal number of objects considered as a C Luster) [M,n]=size (x); Eps= ((PROD (max (x)-min (x)) *k*gamma (. 5*n+1))/(M*sqrt (Pi.^n))). ^ (1/n);

Note: PROD is the product of elements within an array, and A^n is the n-th square of each element in a a*a*....*a,a.^n.

Dbscan algorithm of clustering based on density

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.