If there is only one density-based clustering algorithm in the world, then it must be Dbscan (density-based spatial clustering of applications with noise). Dbscan as a typical density-based clustering algorithm, compared to Kmeans, the biggest advantage is that you can determine the number of clusters, while filtering some noise, but relative, the parameters are more sensitive to the parameters, and the parameter tuning depends on the experience.
First, the algorithm
For the algorithm part only do some "perceptual" analysis, the specific algorithm of the theoretical proof and more accurate formal description of the reference Wiki:http://en.wikipedia.org/wiki/dbscan
Dbscan algorithm is relatively simple, as long as the understanding of a few concepts, the algorithm itself is inevitable.
(1) Several variables
Domain radius e, minimum number minopt
(2) Several nouns
Core object: If an object has a number of objects within its domain radius e greater than or equal to minopt, the object is called the core object.
Direct density up to: If a core object p, its domain radius has a number of points Q, then for each q has q from the object P direct density can be reached.
(3) algorithm flow
Main flow: input e,minopt and object set n
I, find an unmarked core object K, and set this object as marked, if the core object cannot be found to exit directly
II. Extend this core object, expand (k)
III. If all objects are marked, exit, otherwise turn I
Expand process: Input Core Object K
I, initialize a set of S, put in K
II. Iterate over the collection element, and for each core object in the collection, find all of its unlabeled density up to the object, put it in the set S, and set as marked
III, if I do not add any new objects, then exit, otherwise Turn II
When analyzing the implementation of Weka, in addition to the code itself, focus on the following points:
(1) Whether special data structures are used to improve efficiency
(2) Handling of missing values
(3) Handling of noise
(4) Other implementation techniques
(5) different from the original Dbscan
Second, Sequentialdatabase class
Before analyzing the specific Buildclusterer method, first analyze the Sequentialdatabase class, which is an auxiliary class used by the Dbscan method, encapsulating a instance and exposing some custom query operations.
(1) Epsilonrangequery, which is used to find all objects Querydataobject distance epsilon from a given object
Public List epsilonrangequery (double epsilon, DataObject querydataobject) { ArrayList epsilonrange_list = new ArrayList (); Iterator Iterator = Dataobjectiterator (); while (Iterator.hasnext ()) { DataObject DataObject = (DataObject) iterator.next (); Double distance = querydataobject.distance (dataObject);//default, distance calculator is European distance if (distance < epsilon) { Epsilonrange_list.add (DataObject); } } return epsilonrange_list; }
You can see that the function iterates through all the objects, so the time complexity is O (n)
(2) Returns a list, where Index0 is the nearest K object, Index1 is less than epsilon distance of the object
Public List k_nextneighbourquery (int k, double epsilon, DataObject DataObject) {Iterator Iterator = Dataobjectite Rator (); List return_list = new ArrayList (); List nextneighbours_list = new ArrayList (); List epsilonrange_list = new ArrayList (); Priorityqueue priorityqueue = new Priorityqueue (); while (Iterator.hasnext ()) {DataObject Next_dataobject = (DataObject) iterator.next (); Double dist = dataobject.distance (next_dataobject); if (Dist <= Epsilon) epsilonrange_list.add (new Epsilonrange_listelement (Dist, next_dataobject)); if (Priorityqueue.size () < K) {Priorityqueue.add (dist, next_dataobject); } else {if (Dist < priorityqueue.getpriority (0)) {priorityqueue.next ();//To remove the maximum distance, To implement a fixed-length queue Priorityqueue.add (dist, next_dataobject); }}} while (Priorityqueue.hasnext ()){nextneighbours_list.add (0, Priorityqueue.next ());//write the priority queue to the list and add it to index0 each time to see that the list is an ascending list. } return_list.add (Nextneighbours_list); Return_list.add (epsilonrange_list); return return_list; }
The design of this function must be Index0: the first convention-based programming, which contracts the data of both the index1 and the list, and also contracts the objects in which lists are stored, and also contracts the ascending order of elements in the priority queue, making this function reusable and its low. The second and epsilonrangequery are partially duplicated (but cannot be called epsilonrangequery because the equivalent of all object traversal is called two times).
(3) Coredistance, the function not only returns the list of the above functions, but also adds the index3 to be farthest away and less than the Epsilon object.
public list coredistance (int minpoints, double epsilon, DataObject DataObject) { List list = k_nextneighbourquery (min Points, Epsilon, dataObject); if ((List) List.get (1)). Size () < minpoints) { list.add (new Double (dataobject.undefined)); return list; } else { list nextneighbours_list = (list) list.get (0); Priorityqueueelement priorityqueueelement = (priorityqueueelement) nextneighbours_list.get (nextNeighbours_ List.size ()-1); if (priorityqueueelement.getpriority () <= epsilon) { list.add (new Double (Priorityqueueelement.getpriority ()) ); return list; } else { List.add (new Double (dataobject.undefined)); return list;}} }
Third, Buildclusterer
Then, from Buildclusterer, the function is the entrance to all the clusters used to train a cluster using a known sample.
The function itself is relatively simple.
public void Buildclusterer (Instances Instances) throws Exception {//First measure this instance can be clustered with Dbscan, Dbscan can handle almost all classes Type (enumeration, date, value, Missingvalue) getcapabilities (). Testwithfail (instances); Long time_1 = System.currenttimemillis (); Processed_instanceid = 0; numberofgeneratedclusters = 0; Clusterid = 0; Replacemissingvalues_filter = new Replacemissingvalues (); Replacemissingvalues_filter.setinputformat (instances); Instances filteredinstances = Filter.usefilter (Instances, Replacemissingvalues_filter); Database = Databaseforname (Getdatabase_type (), filteredinstances); for (int i = 0; i < database.getinstances (). Numinstances (); i++) {DataObject DataObject = Dataobjectforname (Getdatabase_distancetype (), database.getinstances (). instance (i), integer.tostring (i ), database); Database.insert (dataObject);//INSERT INTO database} Database.setminmaXValues (); Iterator Iterator = Database.dataobjectiterator (); while (Iterator.hasnext ()) {//is not the most efficient iteration for all nodes, if a variable is used to record the current number of unclassfied, a direct exit is more efficient when it is 0, although the time complexity does not change. DataObject DataObject = (DataObject) iterator.next (); if (dataobject.getclusterlabel () = = dataobject.unclassified) {if (Expandcluster (DataObject)) {//If a point is not marked, then Try to extend the clusterid++; numberofgeneratedclusters++; }}} Long time_2 = System.currenttimemillis (); ElapsedTime = (double) (time_2-time_1)/1000.0;//It is strange that the implementation of Weka has different programming styles, at least in the past the cluster or classifier is not directly in the training function to calculate the time spent. }
Iv. Expandcluster
The extension core node is the main function of a cluster, and if the successful extension returns TRUE, it returns false, as follows:
Private Boolean Expandcluster (DataObject DataObject) {List seedlist = Database.epsilonrangequery (Getepsilon (), DAT Aobject);//This function looks for objects within a given object distance epsilon if (seedlist.size () < getminpoints ()) {Dataobject.setclusterlabel (dataobject.noise);//If it is a non-core object, it is temporarily set to NOISE, and then it is NOISE if it cannot be clustered by the core object. return false; }//Go here is the core object for (int i = 0; i < seedlist.size (); i++) {DataObject Seedlistdataobject = (Da Taobject) Seedlist.get (i); Seedlistdataobject.setclusterlabel (Clusterid);//All Seedlist objects belong to Clusterid, this clusterid is a self-increment if (Seedlistdatao Bject.equals (DataObject)) {seedlist.remove (i);//note that epsilonrangequerylist will put the parameter object in itself, so here to remove i--; }} for (int j = 0; J < Seedlist.size (); j + +) {DataObject Seedlistdataobject = (DataObject) s Eedlist.get (j); List Seedlistdataobject_neighbourhood = Database.epsilonrangequery (Getepsilon (), SeedliStdataobject); For each element in Seedlist, look for an element within its domain if (seedlistdataobject_neighbourhood.size () >= getminpoints ()) { for (int i = 0; i < seedlistdataobject_neighbourhood.size (); i++) {//go to this loop to describe the core object DataObject p = (DataObject) Seedlistdataobject_neighbourhood.get (i); if (p.getclusterlabel () = = Dataobject.unclassified | | P.getclusterlabel () = = dataobject.noise) {<span style= " White-space:pre "></span> if (p.getclusterlabel () = = dataobject.unclassified) { Seedlist.add (P);//If it is unclassified, it is added to the seedlist, where unclassified is used to ensure that no repetition is added, and Nosie is not added because noise is definitely not the core object (logic guarantee at the beginning of this function). is a trick, using a list to add subscript to the effect of set, if let me to implement the estimate I will directly use set Bar} p.setclusterlabel (Cluste RID);//set to the corresponding cluster}}} seedlist.remove (j);//Not very clear why this is to remove, supposedly traverse the No further access is not necessary to delete, perhaps in order to save memory, may also be the author of Obsessive-Compulsive disorder (the author of this code does not seem to like to use iterators, and multiple use based on the nextThe deletion of the subject, in Java this is not a very elegant programming method, although I often use this j--; } return true; }
Five, time complexity analysis
The main loop of the Buildclusterer function calls Eplisonrangequery for each element of the list in the N,expandcluster function, so it is n^2 and, in summary, the entire algorithm is n^3, not very efficient.
Optimization points:
Buildclusterer does not produce optimizations that are better than O (n), but can use counters to record unmarked quantities to improve some efficiency, expandcluster is no optimization point, but eplisonrangequery at least two places to optimize, The first is to use Kdtree (like the Xmean algorithm, see the previous blog) to more effectively find the distance from the given point distance, followed by using the cache to buffer some of the distance to the point pair, because considering the same point in the program is actually calculated several times.
Liu, Clusterinstance
This function takes a instance as a parameter and should return the instance dependent cluster, but Dbscan does not seem to have done so.
public int clusterinstance (Instance Instance) throws Exception { if (Processed_instanceid >= database.size ()) Processed_instanceid = 0; int cnum = (Database.getdataobject (integer.tostring (processed_instanceid++))). Getclusterlabel (); if (Cnum = = dataobject.noise) throw new Exception (); else return cnum; }
The subscript of the use case with ID 0,1,2 is returned in turn, not knowing what it is intended to do.
And if it is a noise directly throws an exception, and does not explain why the exception is thrown at all.
The whole function is of unknown significance.
Vii. Summary
If you want to write a summary, then I personally for this code is relatively disappointed, whether it is some function abstract design, data structure design, Java code style, there is a thick "amateur" flavor, and the former classifier neat code compared to completely different (well, it is not a person to write).
In addition to the final clusterinstance behavior and comments completely inconsistent, do not know whether it is a bug or feature or other causes.
Weka algorithm Clusterers-dbscan Source code Analysis