According to the shape of the cluster in the second dataset above, the clustering result should be connected to a cluster, but the K-means results are not satisfactory, so here we introduce a new clustering method. This method is different from the previous division-based method. Based on division, we mainly find circular or spherical clusters. To find clusters of any shape, A density-based clustering method is used to treat clusters as dense object areas separated by low-density areas in a data space. This concept is just in line with the characteristics of a dataset.
DBSCAN: A density-based clustering method based on high-density connected areas. This algorithm divides high-density areas into clusters, clusters of any shape are found in a space database with noise. It defines a cluster as the largest set of points connected by density. To understand the idea of density-based clustering, we must first master the following definitions:
The neighbor in the given object radius R is called the r neighbor of the object;
If the r neighbor of an object contains at least the minimum number of minpts objects, the object is called the core object;
Given an object set D, if p is in the r neighborhood of Q and Q is a core object, then we say that object P is directly density reachable from object Q;
If an object chain P1, P2, P3,..., Pn, P1 = Q, Pn = P exists, for pi belongs to D, I belongs to 1 ~ N, P (I + 1) is achieved from the direct density of PI on R and minpts, then the object P is reachable from the density of Q on R and minpts;
If the existing object o belongs to D, so that the object p and q are both reachable from o to R and minpts density, then the object P to Q is related to R and minpts density connection;
Density reachable is the transfer closure of direct density reachable. This kind of relationship is asymmetric, only the mutual density of core objects can be reached; density connection is a symmetric relationship;
With the above concepts, the next step is the Algorithm Description: DBSCAN searches for clusters by checking the r neighborhood of each vertex in the database. If the r neighbor of point P contains more minpts points, a new cluster with P as the core object is created. Then, DBSCAN iterates the objects that can be directly density up from these core objects. This process may involve merging density-up clusters. This process ends when no new vertex can be added to any cluster;
The specific pseudocode is described as follows (from Wikipedia ):
1 DBSCAN (D, EPS, minpts) 2 c = 0 // class ID 3 for each unvisited point P in dataset D // traverse 4 Mark P as visited // already accessed 5 neighborpts = regionquery (p, EPS) // calculate the neighbor of this point. 6 if sizeof (neighborpts) <minpts // it cannot be used as the core point. 7 Mark P as noise // It is marked as the noise data 8 else // It is used as the core point, create a Class 9 C = next cluster10 expandcluster (p, neighborpts, C, EPS, minpts) based on this point // according to this core store extended class 11 12 expandcluster (p, neighborpts, c, EPS, minpts) 13 add P to cluster C // expand the category. The core store first adds 14 for each point P' in neighborpts // and then targets the points in the Core Store's neighborhood, if this point is not accessed, 15 if P' is not visited16 mark P' as visited // access 17 neighborpts '= regionquery (P', EPS) // if this point is a core point, then the 18 if sizeof (neighborpts)> = minpts19 neighborpts = neighborpts joined with neighborpts '20 if P' is not yet member of any cluster // If the neighborpts is not the core point, there is no category, such as noise data, add this category 21 Add p' to cluster c22 23 regionquery (p, EPS) // calculate the 24 return all points within P's EPS-neighborhood
The time complexity limit of this algorithm mainly lies in the calculation of the neighboring area. If an index exists, the time complexity can be reduced. If no index exists, the neighboring computing needs to traverse all points, in this way, the time complexity is relatively high, which is equivalent to the N loop of the double layer. Therefore, the time complexity is O (n2 );
However, for time with the same time complexity, the running time of each algorithm varies. If many optimizations can be made, the reduction of some constant time can also reduce the time, because I only implement pseudo-code and do not optimize the data structure, the relative data structure is supplemented by the data structure used for K-means, for example, adding density-based attributes, such as whether to be accessed, whether it is noisy, and whether it is added to the neighborhood can quickly produce results based on the scale of Dataset 1, however, there are a lot of problems with the data volume of Dataset 2 around, which is time-consuming or even intolerable. The release optimization is better, and debugging basically cannot wait for the results, therefore, any program must be optimized to your own optimum;
The following describes the specific implementation of this algorithm:
Colorref colorfultmp [] = {RGB (192,192,192), RGB (255,255, 0), RGB (227,207, 87), RGB (255,153, 18), RGB (235,142, 85 ), RGB (255,227,132), RGB (176,224,230, 97, 0), RGB (65,105,230), RGB (0,255,255), RGB (), RGB (, 15), RGB, 84), RGB (34,139, 42), RGB (240, 34), RGB (,), RGB (, 0), RGB (, 96 ), RGB (176, 23, 31), RGB (0, 0)}; void cluster: DBSCAN (vector <abstractdatapt *> pts, double EPS, int minpts) {int categroy = 0; int I, length = PTS. size (); vector <abstractdatapt *> neightpts, tmpneightpts; // traverse all vertices for (I = 0; I <length; I ++) {// sleep (1000); // if there is no access, mark it as access if (! PTS [I]-> m_bvisited) {PTS [I]-> m_bvisited = true; // calculate neightpts = PTS [I]-> regionquery (PTS, PTS [I], EPS); // include // noise mark if (neightpts. size () <minpts) {PTS [I]-> categroy = 0; // noisepts [I]-> m_isnoise = true; PTS [I]-> m_colorpt = colorfultmp [PTS [I]-> categroy % 18];} else {// core point, clustering categroy ++; PTS [I]-> categroy = categroy; PTS [I]-> m_colorpt = colorfultmp [PTS [I]-> categroy % 18]; // The vertex mark in the neighborhood has been added for (int K = 0; k <neightpts. size (); k ++) {neightpt S [k]-> m_isinclude = true;} // continue to expand the clustering class Int J = 1; // because the size of the neightpts is increased, whilewhile (j <neightpts. size () {// No category is added to this category, extended category if (neightpts [J-1]-> categroy = 0) {neightpts [J-1]-> categroy = categroy; neightpts [J-1]-> m_colorpt = colorfultmp [neightpts [J-1]-> categroy % 18];} // tag access if (! Neightpts [J-1]-> m_bvisited) {neightpts [J-1]-> m_bvisited = true; // calculate the neighborhood tmpneightpts = neightpts [J-1]-> regionquery (PTS, neightpts [J-1], EPS); // merge the neighboring if (tmpneightpts) of the core point. size ()> = minpts) {// merge int M, N; // if it has already been added, do not add for (m = 0; m <tmpneightpts. size (); m ++) {/* If (tmpneightpts [m]-> categroy = 0) {neightpts. push_back (tmpneightpts [m]);} * // do not add again // feel the bottleneck of the time limit here if (! Tmpneightpts [m]-> m_isinclude) {neightpts. push_back (tmpneightpts [m]); tmpneightpts [m]-> m_isinclude = true;}/* bool exist = false; For (n = 0; n <neightpts. size (); N ++) {If (tmpneightpts [m] = neightpts [N]) {exist = true ;}} if (! Exist) {neightpts. push_back (tmpneightpts [m]);} */}}// continue J ++ ;}}}}}
The code for calculating the neighborhood is as follows (all vertices are calculated cyclically ):
vector<AbstractDataPT*> DataPT2D::regionQuery(vector<AbstractDataPT*> pts,AbstractDataPT* pt,double eps){vector<AbstractDataPT*> neighbor;int i,length=pts.size();for (i=0;i<length;i++){if (pt->calculateD(pts[i])<=eps){neighbor.push_back(pts[i]);}}return neighbor;}
Time performance is very slow, and improvements are still needed;
The data types are as follows:
# Pragma once # include <vector> using namespace STD; Class abstractdatapt {public: abstractdatapt (void); Virtual ~ Abstractdatapt (void); Virtual double calculated (abstractdatapt * otherpt) = 0; virtual void drawself (CDC * PDC) = 0; virtual void drawnormalself (CDC * PDC) = 0; virtual vector <abstractdatapt *> calculatemeans (vector <abstractdatapt *> pts, int K) = 0; virtual double calculatee (vector <abstractdatapt *> pts, vector <abstractdatapt *> kmeans) = 0; virtual vector <abstractdatapt *> regionquery (vector <abstractdatapt *> pts, abstractd Atapt * PT, double EPS) = 0; public: // type of int categroy; colorref m_colorpt; bool m_bvisited; bool m_isinclude; bool m_isnoise ;}; # pragma once # include "abstractdatapt. H "class datapt2d: Public abstractdatapt {public: datapt2d (void); datapt2d (double CX, double CY); datapt2d (double CX, double cy, colorref color );~ Datapt2d (void); double calculated (abstractdatapt * otherpt); void drawself (CDC * PDC); void drawnormalself (CDC * PDC ); vector <abstractdatapt *> calculatemeans (vector <abstractdatapt *> pts, int K); double calculatee (vector <abstractdatapt *> pts, vector <abstractdatapt *> kmeans ); vector <abstractdatapt *> regionquery (vector <abstractdatapt *> pts, abstractdatapt * PT, double EPS); double m_fcx; double m_fcy ;};
After N years of release, the result is finally displayed, as shown below:
The result is a bit strange, because only 18 colors are defined, and classes are recycled.
Correction: During the preprocessing of the second question, some data problems are ignored here, so the clustering effect of the second dataset is not very good. Here, the correction is not just to retrieve the white points, here, the filter condition is enhanced to remove a slightly lighter vertex. Here, the settings are as follows:
if (nCR>235&&nCG>235&&nCB>235){continue;}
Another advantage of such processing is that some vertices are filtered and the number of vertices is reduced;
In this way, you can get a clear image, as shown below. The image is enlarged here, so some of the images are not displayed:
Here we found a way to improve the time efficiency. The DBSCAN method in the book is to improve the time efficiency by using spatial indexes. Here we find a rtree method, but the implementation is expensive, I don't know how to implement it. I downloaded the source code of the rtree implementation and had time to learn about this data structure ~
There is another method that also improves efficiency. For this data, it can also reduce a lot of time complexity. For this data, better implementation should be able to achieve linear time, the results can be run out in five minutes. In fact, this is also very slow, because my program implementation is redundant and messy in many places, and the modification quality of program writing is not high, it is not easy to modify. I have reached my goal in five minutes, so I will not modify it. If you see this principle, it may be within one minute;
The principle is as follows. When calculating the neighbor of each vertex, the graph is first divided into rectangular grids based on the radius. Then, each vertex only needs to calculate the content of the nine grids around it, for this data, we can see that the constant time ~ Preprocessing is linear;
The main code is as follows:
Vector <abstractdatapt *> datapt2d: regionquery (vector <abstractdatapt *> pts, abstractdatapt * PT, double EPS) {int Index = regionnum (PT, EPS ); // index-1-x_2d_max/EPS, index-x_2d_max/EPS; index + 1-x_2d_max/EPS // index-1, index, index + 1 // index-1 + x_2d_max/EPS, index + x_2d_max/EPS; index + 1 + x_2d_max/epsint num = x_2d_max/EPS; vector <abstractdatapt *> TMP; int I, J, K, length = PTS. size (); for (I =-1; I <2; I ++) {for (j =-1; j <2; j ++) {int tmpindex = index + I + num * j; If (tmpindex <0) {tmpindex = 0;} int size = region [tmpindex]-> regionby. size (); For (k = 0; k <size; k ++) {TMP. push_back (region [index + I + num * j]-> regionby [k]); }}vector <abstractdatapt *> neighbor; length = TMP. size (); for (I = 0; I <length; I ++) {If (Pt-> calculated (TMP [I]) <= EPS) {neighbor. push_back (TMP [I]);} return neighbor;} int datapt2d: regionnum (abstractdatapt * PT, double EPS) {// convert datapt2d * tmpdatapt = (datapt2d *) PT; int x_index = tmpdatapt-> m_fcx/EPS; int y_index = tmpdatapt-> m_fcy/EPS; int num = x_2d_max/EPS; int result = x_index + y_index * num; return result ;}
Void cluster: DBSCAN (vector <abstractdatapt *> pts, double EPS, int minpts) {// clear the region data structure region. clear (); int categroy = 0; int I, length = PTS. size (); // initialize the region data structure. // The region is long enough to be initialized here. I = 0; I <bigmax; I ++) {region. push_back (New regionvector ();} // initialize region data based on the radius (I = 0; I <length; I ++) {region [PTS [I]-> regionnum (PTS [I], EPS)]-> regionby. push_back (PTS [I]); //?} Vector <abstractdatapt *> neightpts, tmpneightpts; // traverse all vertices for (I = 0; I <length; I ++) {// sleep (1000 ); // if there is no access, mark it as access if (! PTS [I]-> m_bvisited) {PTS [I]-> m_bvisited = true; // calculate neightpts = PTS [I]-> regionquery (PTS, PTS [I], EPS); // include // noise mark if (neightpts. size () <minpts) {PTS [I]-> categroy = 0; // noisepts [I]-> m_isnoise = true; PTS [I]-> m_colorpt = colorfultmp [PTS [I]-> categroy % 18];} else {// core point, clustering categroy ++; PTS [I]-> categroy = categroy; PTS [I]-> m_colorpt = colorfultmp [PTS [I]-> categroy % 18]; // The vertex mark in the neighborhood has been added for (int K = 0; k <neightpts. size (); k ++) {neightpt S [k]-> m_isinclude = true;} // continue to expand the clustering class Int J = 1; // because the size of the neightpts is increased, whilewhile (j <neightpts. size () {// No category is added to this category, extended category if (neightpts [J-1]-> categroy = 0) {neightpts [J-1]-> categroy = categroy; neightpts [J-1]-> m_colorpt = colorfultmp [neightpts [J-1]-> categroy % 18];} // tag access if (! Neightpts [J-1]-> m_bvisited) {neightpts [J-1]-> m_bvisited = true; // calculate the neighborhood tmpneightpts = neightpts [J-1]-> regionquery (PTS, neightpts [J-1], EPS); // merge the neighboring if (tmpneightpts) of the core point. size ()> = minpts) {// merge int M, N; // if it has already been added, do not add for (m = 0; m <tmpneightpts. size (); m ++) {/* If (tmpneightpts [m]-> categroy = 0) {neightpts. push_back (tmpneightpts [m]);} * // do not add again // feel the bottleneck of the time limit here if (! Tmpneightpts [m]-> m_isinclude) {neightpts. push_back (tmpneightpts [m]); tmpneightpts [m]-> m_isinclude = true;}/* bool exist = false; For (n = 0; n <neightpts. size (); N ++) {If (tmpneightpts [m] = neightpts [N]) {exist = true ;}} if (! Exist) {neightpts. push_back (tmpneightpts [m]);} */}}// continue J ++ ;}}}}}
Only the code for initializing the Lattice Based on the radius is added, and the part during the modification is not modified;