A simple and easy-to-learn machine learning algorithm--density-based clustering algorithm Dbscan

Last Update:2018-05-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

An overview of density-based clustering algorithms recently, a density-based clustering algorithm in science, "clustering by fast search and find of density peaks" attracted attention (in my blog "The Machine Learning algorithm--the base The clustering algorithm for density peaks is also described in Chinese). So I want to understand the density-based clustering algorithm, familiar with the density-based clustering algorithm and distance-based clustering algorithm, such as the difference between the K-means algorithm. The main goal of density-based clustering algorithm is to find high-density regions separated by low-density regions. Unlike distance-based clustering algorithms, cluster results based on distance clustering are spherical clusters, and density-based clustering algorithms can discover arbitrary-shaped clusters, which plays an important role in data with noisy points. Second, the principle of DBSCAN algorithm 1, the basic concept of DBSCAN (density-based Spatial Clustering of application with Noise) is a typical density-based clustering algorithm, In the Dbscan algorithm, the data points are divided into three categories:

Core point. In the RADIUS EPS contains more than minpts number of points
The boundary point. The number of points within the radius EPS is less than minpts, but falls within the neighborhood of the core point
Noise point. A point that is neither a core point nor a boundary point

Here there are two quantities, one is the radius eps, the other is the specified number minpts. Some of the other concepts

EPS neighborhood. In simple terms, a collection of points that are less than or equal to the point of the EPS can be expressed as.
Direct density can be reached. If within the EPS neighborhood of the core object, it is said that the object from the object is directly density can be reached.
Density can be reached. For the object chain:, which is from the direct density of EPs and minpts, then the object is from the object about EPS and minpts density can be reached.

2, algorithm flow (flow) Three, experimental simulation in the experiment using two test data sets, the original image of the dataset is as follows: (DataSet 1) (DataSet 2) DataSet 1 is relatively simple. Obviously we can find data set 1 total two classes, DataSet 2 has four classes, below we use the Dbscan algorithm to achieve the clustering of data points: Matlab code main program [Plain]View PlainCopy

Percent DBSCAN
Clear all;
CLC
Percent Import data set
% data = load (' testData.txt ');
data = Load (' testdata_2.txt ');
% definition parameters EPs and minpts
minpts = 5;
Eps = Epsilon (data, minpts);
[M,n] = size (data);
x = [(1:m) ' data];
[M,n] = size (x);% recalculate data set size
types = zeros (1,m);% is used to distinguish core point 1, boundary Point 0 and noise point 1
dealed = zeros (m,1);% is used to determine if the point has been processed, and 0 indicates that it has not been processed
dis = caldistance (x (:, 2:n));
Number = 1;% is used to mark classes
Percent of each point to be processed
For i = 1:m
% found unhandled points
If dealed (i) = = 0
Xtemp = x (i,:);
D = Dis (i,:);% gets the distance from point I to all other points
IND = Find (d<=eps);% finds all points within the radius Eps
The type of the percent difference point
% Boundary point
If Length (Ind) > 1 && Length (Ind) < minpts+1
Types (i) = 0;
Class (i) = 0;
End
% Noise Point
If Length (ind) = = 1
Types (i) =-1;
Class (i) =-1;
dealed (i) = 1;
End
% Core point (here is the key step)
If Length (Ind) >= minpts+1
Types (xtemp) = 1;
Class (Ind) = number;
% to determine if the core point density is up to
While ~isempty (Ind)
ytemp = x (Ind (1),:);
Dealed (Ind (1)) = 1;
IND (1) = [];
D = Dis (ytemp),:);% find distance to IND (1)
Ind_1 = Find (d<=eps);
If Length (ind_1) >1% handles non-noise points
Class (Ind_1) = number;
If Length (ind_1) >= minpts+1
Types (ytemp) = 1;
Else
Types (ytemp) = 0;
End
For J=1:length (ind_1)
If Dealed (Ind_1 (j)) = = 0
Dealed (Ind_1 (j)) = 1;
Ind=[ind Ind_1 (j)];
Class (Ind_1 (j)) =number;
End
End
End
End
Number = number + 1;
End
End
End
% final processing of all unclassified points for noise points
Ind_2 = Find (class==0);
Class (Ind_2) =-1;
Types (ind_2) =-1;
To draw the final cluster diagram
On
For i = 1:m
If class (i) = =-1
Plot (data (i,1), data (i,2), '. R ');
ElseIf Class (i) = = 1
If types (i) = = 1
Plot (data (i,1), data (i,2), ' +b ');
Else
Plot (data (i,1), data (i,2), '. B ');
End
ElseIf Class (i) = = 2
If types (i) = = 1
Plot (data (i,1), data (i,2), ' +g ');
Else
Plot (data (i,1), data (i,2), '. G ');
End
ElseIf Class (i) = = 3
If types (i) = = 1
Plot (data (i,1), data (i,2), ' +c ');
Else
Plot (data (i,1), data (i,2), '. C ');
End
Else
If types (i) = = 1
Plot (data (i,1), data (i,2), ' +k ');
Else
Plot (data (i,1), data (i,2), '. K ');
End
End
End
Hold off

Distance calculation function [Plain]View PlainCopy

Percent calculation the distance between the midpoint and the point of the matrix
function [Dis] = caldistance (x)
[M,n] = size (x);
dis = zeros (m,m);
For i = 1:m
for j = i:m
% calculation of Euclidean distance between point I and Point J
TMP = 0;
For k = 1:n
TMP = tmp+ (x (I,k)-X (J,k)). ^2;
End
Dis (i,j) = sqrt (TMP);
Dis (j,i) = Dis (I,J);
End
End
End

Epsilon function [Plain]View PlainCopy

function [Eps]=epsilon (X,K)
% Function: [Eps]=epsilon (X,k)
%
% Aim:
% analytical to estimating neighborhood radius for DBSCAN
%
% Input:
% X-data matrix (m,n); M-objects, N-variables
% K-number of objects in a neighborhood of a object
% (minimal number of objects considered as a cluster)
[M,n]=size (x);
Eps= ((PROD (max (x)-min (x)) *k*gamma (. 5*n+1))/(M*sqrt (Pi.^n))). ^ (1/n);

The final result (cluster result of DataSet 1) (cluster result of DataSet 2) in the above results, the red dots represent the noise points, the points represent the boundary points, and the cross represents the core points. Different colors represent different classes. References [1] M. Ester, H. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial Databa Ses with noise, www.dbs.informatik.uni-muenchen.de/cgi-bin/papers?query=--CO
[2] M. Daszykowski, B. Walczak, D. L. Massart, looking for Natural Patterns in Data. Part 1:density Based Approach

A simple and easy-to-learn machine learning algorithm--density-based clustering algorithm Dbscan

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

A simple and easy-to-learn machine learning algorithm--density-based clustering algorithm Dbscan

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

A simple and easy-to-learn machine learning algorithm--density-based clustering algorithm Dbscan

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support