A simple and easy-to-learn machine learning algorithm--density-based clustering algorithm Dbscan

Source: Internet
Author: User

An overview of density-based clustering algorithms recently, a density-based clustering algorithm in science, "clustering by fast search and find of density peaks" attracted attention (in my blog "The Machine Learning algorithm--the base The clustering algorithm for density peaks is also described in Chinese).    So I want to understand the density-based clustering algorithm, familiar with the density-based clustering algorithm and distance-based clustering algorithm, such as the difference between the K-means algorithm. The main goal of density-based clustering algorithm is to find high-density regions separated by low-density regions. Unlike distance-based clustering algorithms, cluster results based on distance clustering are spherical clusters, and density-based clustering algorithms can discover arbitrary-shaped clusters, which plays an important role in data with noisy points. Second, the principle of DBSCAN algorithm 1, the basic concept of DBSCAN (density-based Spatial Clustering of application with Noise) is a typical density-based clustering algorithm, In the Dbscan algorithm, the data points are divided into three categories:
    • Core point. In the RADIUS EPS contains more than minpts number of points
    • The boundary point. The number of points within the radius EPS is less than minpts, but falls within the neighborhood of the core point
    • Noise point. A point that is neither a core point nor a boundary point
Here there are two quantities, one is the radius eps, the other is the specified number minpts. Some of the other concepts
    1. EPS neighborhood. In simple terms, a collection of points that are less than or equal to the point of the EPS can be expressed as.
    2. Direct density can be reached. If within the EPS neighborhood of the core object, it is said that the object from the object is directly density can be reached.
    3. Density can be reached. For the object chain:, which is from the direct density of EPs and minpts, then the object is from the object about EPS and minpts density can be reached.
2, algorithm flow (flow) Three, experimental simulation in the experiment using two test data sets, the original image of the dataset is as follows: (DataSet 1) (DataSet 2) DataSet 1 is relatively simple. Obviously we can find data set 1 total two classes, DataSet 2 has four classes, below we use the Dbscan algorithm to achieve the clustering of data points: Matlab code main program [Plain]View PlainCopy
  1. Percent DBSCAN
  2. Clear all;
  3. CLC
  4. Percent Import data set
  5. % data = load (' testData.txt ');
  6. data = Load (' testdata_2.txt ');
  7. % definition parameters EPs and minpts
  8. minpts = 5;
  9. Eps = Epsilon (data, minpts);
  10. [M,n] = size (data);
  11. x = [(1:m) ' data];
  12. [M,n] = size (x);% recalculate data set size
  13. types = zeros (1,m);% is used to distinguish core point 1, boundary Point 0 and noise point 1
  14. dealed = zeros (m,1);% is used to determine if the point has been processed, and 0 indicates that it has not been processed
  15. dis = caldistance (x (:, 2:n));
  16. Number = 1;% is used to mark classes
  17. Percent of each point to be processed
  18. For i = 1:m
  19. % found unhandled points
  20. If dealed (i) = = 0
  21. Xtemp = x (i,:);
  22. D = Dis (i,:);% gets the distance from point I to all other points
  23. IND = Find (d<=eps);% finds all points within the radius Eps
  24. The type of the percent difference point
  25. % Boundary point
  26. If Length (Ind) > 1 && Length (Ind) < minpts+1
  27. Types (i) = 0;
  28. Class (i) = 0;
  29. End
  30. % Noise Point
  31. If Length (ind) = = 1
  32. Types (i) =-1;
  33. Class (i) =-1;
  34. dealed (i) = 1;
  35. End
  36. % Core point (here is the key step)
  37. If Length (Ind) >= minpts+1
  38. Types (xtemp) = 1;
  39. Class (Ind) = number;
  40. % to determine if the core point density is up to
  41. While ~isempty (Ind)
  42. ytemp = x (Ind (1),:);
  43. Dealed (Ind (1)) = 1;
  44. IND (1) = [];
  45. D = Dis (ytemp),:);% find distance to IND (1)
  46. Ind_1 = Find (d<=eps);
  47. If Length (ind_1) >1% handles non-noise points
  48. Class (Ind_1) = number;
  49. If Length (ind_1) >= minpts+1
  50. Types (ytemp) = 1;
  51. Else
  52. Types (ytemp) = 0;
  53. End
  54. For J=1:length (ind_1)
  55. If Dealed (Ind_1 (j)) = = 0
  56. Dealed (Ind_1 (j)) = 1;
  57. Ind=[ind Ind_1 (j)];
  58. Class (Ind_1 (j)) =number;
  59. End
  60. End
  61. End
  62. End
  63. Number = number + 1;
  64. End
  65. End
  66. End
  67. % final processing of all unclassified points for noise points
  68. Ind_2 = Find (class==0);
  69. Class (Ind_2) =-1;
  70. Types (ind_2) =-1;
  71. To draw the final cluster diagram
  72. On
  73. For i = 1:m
  74. If class (i) = =-1
  75. Plot (data (i,1), data (i,2), '. R ');
  76. ElseIf Class (i) = = 1
  77. If types (i) = = 1
  78. Plot (data (i,1), data (i,2), ' +b ');
  79. Else
  80. Plot (data (i,1), data (i,2), '. B ');
  81. End
  82. ElseIf Class (i) = = 2
  83. If types (i) = = 1
  84. Plot (data (i,1), data (i,2), ' +g ');
  85. Else
  86. Plot (data (i,1), data (i,2), '. G ');
  87. End
  88. ElseIf Class (i) = = 3
  89. If types (i) = = 1
  90. Plot (data (i,1), data (i,2), ' +c ');
  91. Else
  92. Plot (data (i,1), data (i,2), '. C ');
  93. End
  94. Else
  95. If types (i) = = 1
  96. Plot (data (i,1), data (i,2), ' +k ');
  97. Else
  98. Plot (data (i,1), data (i,2), '. K ');
  99. End
  100. End
  101. End
  102. Hold off
Distance calculation function [Plain]View PlainCopy
  1. Percent calculation the distance between the midpoint and the point of the matrix
  2. function [Dis] = caldistance (x)
  3. [M,n] = size (x);
  4. dis = zeros (m,m);
  5. For i = 1:m
  6. for j = i:m
  7. % calculation of Euclidean distance between point I and Point J
  8. TMP = 0;
  9. For k = 1:n
  10. TMP = tmp+ (x (I,k)-X (J,k)). ^2;
  11. End
  12. Dis (i,j) = sqrt (TMP);
  13. Dis (j,i) = Dis (I,J);
  14. End
  15. End
  16. End

Epsilon function [Plain]View PlainCopy
  1. function [Eps]=epsilon (X,K)
  2. % Function: [Eps]=epsilon (X,k)
  3. %
  4. % Aim:
  5. % analytical to estimating neighborhood radius for DBSCAN
  6. %
  7. % Input:
  8. % X-data matrix (m,n); M-objects, N-variables
  9. % K-number of objects in a neighborhood of a object
  10. % (minimal number of objects considered as a cluster)
  11. [M,n]=size (x);
  12. Eps= ((PROD (max (x)-min (x)) *k*gamma (. 5*n+1))/(M*sqrt (Pi.^n))). ^ (1/n);


The final result (cluster result of DataSet 1) (cluster result of DataSet 2) in the above results, the red dots represent the noise points, the points represent the boundary points, and the cross represents the core points. Different colors represent different classes. References [1] M. Ester, H. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial Databa Ses with noise, www.dbs.informatik.uni-muenchen.de/cgi-bin/papers?query=--CO
[2] M. Daszykowski, B. Walczak, D. L. Massart, looking for Natural Patterns in Data. Part 1:density Based Approach

A simple and easy-to-learn machine learning algorithm--density-based clustering algorithm Dbscan

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.