Suppose that there can be only one density-based clustering algorithm in the world. Then it must be Dbscan (density-based spatial clustering of applications with noise). Dbscan as a typical density-based clustering algorithm, compared to Kmeans, the greatest advantage is the ability to determine the number of clusters themselves. At the same time can filter some noise points. But the opposite. It is more sensitive to incoming parameters, and the parameter tuning depends on experience.
First, the algorithm
For the algorithm part just do some "perceptual" analysis. The theoretical proof of the detailed algorithm and the more accurate formal descriptive narrative reference Wiki:http://en.wikipedia.org/wiki/dbscan
Dbscan algorithm is relatively simple, only to understand a few concepts, the algorithm itself is inevitable.
(1) Several variables
Domain radius e, minimum number minopt
(2) Several nouns
Core object: If an object has a number of objects within its domain radius e greater than or equal to minopt, the object is called the core object.
Direct density up to: If a core object p, its domain radius has a number of points Q, then for each q has q from the object P direct density can be reached.
(3) algorithm flow
Main flow: input e,minopt and object set n
I, find an unmarked core object K, and set this object as marked. If the core object cannot be found, exit directly
II. Extend this core object, expand (k)
III, if all objects are marked, then exit, otherwise turn I
Expand process: Input Core Object K
I, initialize a set of S. Put K
II, iterate over the collection element. For each core object in the collection, find all of its unlabeled density up to the object, put in the set S, and set as marked
III, if II does not increase no matter what the new object. Then exit. Otherwise turn II
When analyzing the implementation of Weka. In addition to the code itself, focus on the following points:
(1) Whether special data structures are used to improve efficiency
(2) Handling of missing values
(3) Handling of noise
(4) Other implementation techniques
(5) different from the original Dbscan
Second, Sequentialdatabase class
Before analyzing the detailed Buildclusterer method, analyze the Sequentialdatabase class, which is an auxiliary class used by the Dbscan method. Encapsulates a instance and exposes some custom query operations.
(1) Epsilonrangequery, this function is used to find all objects querydataobject distance from a given object epsilon
Public List epsilonrangequery (double epsilon, DataObject querydataobject) {
ArrayList epsilonrange_list = new ArrayList ();
Iterator Iterator = Dataobjectiterator ();
while (Iterator.hasnext ()) {
DataObject DataObject = (DataObject) iterator.next ();
Double distance = querydataobject.distance (dataObject);//default. Distance calculator is European distance
if (distance < epsilon) {
epsilonrange_list.add (dataObject);
}
}
return epsilonrange_list;
}
Can see that the function has traversed all the objects. So time complexity is O (n)
(2) Returns a list in which Index0 is the nearest K object. Index1 is an object less than epsilon distance
Public List k_nextneighbourquery (int k, double epsilon, DataObject DataObject) {Iterator Iterator = Dataobjectit
Erator ();
List return_list = new ArrayList ();
List nextneighbours_list = new ArrayList ();
List epsilonrange_list = new ArrayList ();
Priorityqueue priorityqueue = new Priorityqueue ();
while (Iterator.hasnext ()) {DataObject Next_dataobject = (DataObject) iterator.next ();
Double dist = dataobject.distance (next_dataobject);
if (Dist <= Epsilon) epsilonrange_list.add (new Epsilonrange_listelement (Dist, next_dataobject));
if (Priorityqueue.size () < K) {Priorityqueue.add (dist, next_dataobject); } else {if (Dist < priorityqueue.getpriority (0)) {priorityqueue.next ();//shift the maximum distance
In addition, to implement a fixed-length queue Priorityqueue.add (dist, next_dataobject); }}} while (Priorityqueue.hasnext ()) {nextneighbours_list.add (0, Priorityqueue.next ()),//writes the priority queue to the list, and each time it is added to the index0 to see this L The ist is an ascending list.} return_list.add (Nextneighbours_list);
Return_list.add (epsilonrange_list);
return return_list; }
The design of this function must be in the Groove: the first convention-based programming, which contracts Index0 and index1 data. It also agreed on the objects stored in the list. It is also agreed that the ascending order of elements in the priority queue makes this function reusable and its low.
The second and epsilonrangequery are partially repeated (but cannot be called epsilonrangequery, as the equivalent of all objects traversed two times).
(3) Coredistance. The function not only returns the list of the above functions, but also adds the index3 to the farthest and smaller than the Epsilon object.
public list coredistance (int minpoints, double epsilon, DataObject DataObject) {
List list = k_nextneighbourquery (min Points, Epsilon, dataObject);
if ((List) List.get (1)). Size () < minpoints) {
list.add (new Double (dataobject.undefined));
return list;
} else {
list nextneighbours_list = (list) list.get (0);
Priorityqueueelement priorityqueueelement =
(priorityqueueelement) nextneighbours_list.get (nextNeighbours_ List.size ()-1);
if (priorityqueueelement.getpriority () <= epsilon) {
list.add (new Double (Priorityqueueelement.getpriority ()) );
return list;
} else {
List.add (new Double (dataobject.undefined));
return list;}}
}
Third, Buildclusterer
Then, from Buildclusterer, the function is the entrance to all the clusters. Used to train a cluster using a known sample.
The function itself is relatively simple.
public void Buildclusterer (Instances Instances) throws Exception {//First measure if this instance can be clustered with Dbscan.
Dbscan can handle all types (enumerations, dates, values, Missingvalue) getcapabilities (). Testwithfail (instances);
Long time_1 = System.currenttimemillis ();
Processed_instanceid = 0;
numberofgeneratedclusters = 0;
Clusterid = 0;
Replacemissingvalues_filter = new Replacemissingvalues ();
Replacemissingvalues_filter.setinputformat (instances);
Instances filteredinstances = Filter.usefilter (Instances, Replacemissingvalues_filter);
Database = Databaseforname (Getdatabase_type (), filteredinstances); for (int i = 0; i < database.getinstances (). Numinstances (); i++) {DataObject DataObject = Dataobjectfornam E (Getdatabase_distancetype (), database.getinstances (). instance (i), Integer.tostrin
G (i), database);
Database.insert (dataObject);//INSERT INTO database} Database.setminmaxvalues ();
Iterator Iterator = Database.dataobjectiterator (); while (Iterator.hasnext ()) {//is not the most efficient iteration for all nodes, assuming that a variable is used to record the current number of unclassfied, a direct exit is more efficient when 0, although the time complexity does not change.DataObject DataObject = (DataObject) iterator.next (); if (dataobject.getclusterlabel () = = dataobject.unclassified) {if (Expandcluster (DataObject)) {//Assuming a point is not marked,
Then try to extend the clusterid++;
numberofgeneratedclusters++;
}}} Long time_2 = System.currenttimemillis (); ElapsedTime = (double) (time_2-time_1)/1000.0;//very strange, the implementation of Weka has different programming styles, at least the previous cluster or classifier. It is not directly in the training function to calculate the time spent.
}
Iv. Expandcluster
The extension core node is the main function of a cluster, and if successful expansion returns TRUE, False is returned, such as the following:
Private Boolean Expandcluster (DataObject DataObject) {List seedlist = Database.epsilonrangequery (Getepsilon (), Da Taobject);//This function looks for objects within a given object distance epsilon if (seedlist.size () < getminpoints ()) {Dataobject.setclusterla Bel (dataobject.noise);//assumptions are non-core objects.
Temporarily set to noise, then assume that the core object cannot be clustered until it has been noise.
return false; }//Go here is the core object for (int i = 0; i < seedlist.size (); i++) {DataObject Seedlistdataobject =
(DataObject) Seedlist.get (i); Seedlistdataobject.setclusterlabel (Clusterid);//All Seedlist objects belong to Clusterid, this clusterid is a self-increment if (seedlistdata Object.Equals (DataObject)) {seedlist.remove (i);//note that epsilonrangequerylist will put the object itself in.
So here to remove the i--; }} for (int j = 0; J < Seedlist.size (); j + +) {DataObject Seedlistdataobject = (dataobjec
T) Seedlist.get (j); List Seedlistdataobject_neighbourhood = Database.epsilonrangequery (gEtepsilon (), seedlistdataobject);
For each element in Seedlist, look for an element within its realm if (Seedlistdataobject_neighbourhood.size () >= getminpoints ()) { for (int i = 0; i < seedlistdataobject_neighbourhood.size (); i++) {//go to this loop to describe the core object DataObject p
= (DataObject) seedlistdataobject_neighbourhood.get (i); if (p.getclusterlabel () = = Dataobject.unclassified | | P.getclusterlabel () = = dataobject.noise) {<span style= "
White-space:pre "> </span> if (p.getclusterlabel () = = dataobject.unclassified) { Seedlist.add (P);//If it is not classified, it is added to the seedlist. Unclassified is used here to ensure that no repetition is added, and Nosie does not join because noise is certainly not the core object (the logic guarantee at the beginning of this function) is a trick. Using a list to add subscript to the effect of the set, assuming let me implement the expected I will directly use set Bar} p.setclusterlabel (Clusterid);// Set to the corresponding cluster}}} seedlist.remove (j);//Not very clear why to remove here. It is no longer necessary to delete after traversal. Perhaps in order to save memory. AlsoPerhaps the author of obsessive-Compulsive disorder (the author of this code does not seem to like using iterators, and many times using subscript-based deletions, in Java this is not a very elegant way of programming, although I often use this) j--;
} return true; }
Five, time complexity analysis
The Buildclusterer function main loop calls Eplisonrangequery for each element in the list as a n,expandcluster function. So it is n^2 that the whole algorithm is n^3 and not very efficient.
Optimization points:
Buildclusterer does not produce optimizations that are better than O (n), but is able to use counters to record unmarked quantities to improve some efficiency, and expandcluster no optimization points. But eplisonrangequery at least two places to be able to optimize, the first is to use Kdtree (like the Xmean algorithm, see the previous blog) to more effectively find the distance from the distance to the nearest point, followed by using the cache to buffer some of the distance to the fixed-point pair. Due to the fact that the same point is actually calculated several times in the program.
Liu, Clusterinstance
This function receives a instance as a parameter and should return the instance dependent cluster. But Dbscan does not seem to have done so.
public int clusterinstance (Instance Instance) throws Exception {
if (Processed_instanceid >= database.size ()) Processed_instanceid = 0;
int cnum = (Database.getdataobject (integer.tostring (processed_instanceid++))). Getclusterlabel ();
if (Cnum = = dataobject.noise)
throw new Exception ();
else
return cnum;
}
The subscript of the use case with ID 0,1,2 is returned in turn, not knowing what it is intended to do.
It is assumed that a noise throws an exception directly and does not explain why the exception was thrown at all.
The whole function is of unknown significance.
Vii. Summary
If you want to write a summary, then I personally for this code is more disappointing, whether it is some function abstract design, data structure design, Java code style, there is a thick "amateur" flavor, and the previous classifier neat code compared to completely the same (well, it was not written by a person).
In addition to the final clusterinstance behavior and gaze completely inconsistent, do not know whether it is a bug or feature or other causes.
Weka algorithm Clusterers-dbscan Source code Analysis