Weka algorithm Clusterers-dbscan Source code Analysis

Last Update:2016-03-12 Source: Internet

Author: User

Tags extend

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Suppose that there can be only one density-based clustering algorithm in the world. Then it must be Dbscan (density-based spatial clustering of applications with noise). Dbscan as a typical density-based clustering algorithm, compared to Kmeans, the greatest advantage is the ability to determine the number of clusters themselves. At the same time can filter some noise points. But the opposite. It is more sensitive to incoming parameters, and the parameter tuning depends on experience.

First, the algorithm

For the algorithm part just do some "perceptual" analysis. The theoretical proof of the detailed algorithm and the more accurate formal descriptive narrative reference Wiki:http://en.wikipedia.org/wiki/dbscan

Dbscan algorithm is relatively simple, only to understand a few concepts, the algorithm itself is inevitable.

(1) Several variables

Domain radius e, minimum number minopt

(2) Several nouns

Core object: If an object has a number of objects within its domain radius e greater than or equal to minopt, the object is called the core object.

Direct density up to: If a core object p, its domain radius has a number of points Q, then for each q has q from the object P direct density can be reached.

(3) algorithm flow

Main flow: input e,minopt and object set n

I, find an unmarked core object K, and set this object as marked. If the core object cannot be found, exit directly

II. Extend this core object, expand (k)

III, if all objects are marked, then exit, otherwise turn I

Expand process: Input Core Object K

I, initialize a set of S. Put K

II, iterate over the collection element. For each core object in the collection, find all of its unlabeled density up to the object, put in the set S, and set as marked

III, if II does not increase no matter what the new object. Then exit. Otherwise turn II

When analyzing the implementation of Weka. In addition to the code itself, focus on the following points:

(1) Whether special data structures are used to improve efficiency

(2) Handling of missing values

(3) Handling of noise

(4) Other implementation techniques

(5) different from the original Dbscan

Second, Sequentialdatabase class

Before analyzing the detailed Buildclusterer method, analyze the Sequentialdatabase class, which is an auxiliary class used by the Dbscan method. Encapsulates a instance and exposes some custom query operations.

(1) Epsilonrangequery, this function is used to find all objects querydataobject distance from a given object epsilon

Public List epsilonrangequery (double epsilon, DataObject querydataobject) {
        ArrayList epsilonrange_list = new ArrayList ();
        Iterator Iterator = Dataobjectiterator ();
        while (Iterator.hasnext ()) {
            DataObject DataObject = (DataObject) iterator.next ();
            Double distance = querydataobject.distance (dataObject);//default. Distance calculator is European distance
            if (distance < epsilon) {
                epsilonrange_list.add (dataObject);
            }
        }

        return epsilonrange_list;
    }

Can see that the function has traversed all the objects. So time complexity is O (n)
(2) Returns a list in which Index0 is the nearest K object. Index1 is an object less than epsilon distance

 Public List k_nextneighbourquery (int k, double epsilon, DataObject DataObject) {Iterator Iterator = Dataobjectit

        Erator ();
        List return_list = new ArrayList ();
        List nextneighbours_list = new ArrayList ();

        List epsilonrange_list = new ArrayList ();

        Priorityqueue priorityqueue = new Priorityqueue ();
            while (Iterator.hasnext ()) {DataObject Next_dataobject = (DataObject) iterator.next ();

            Double dist = dataobject.distance (next_dataobject);

            if (Dist <= Epsilon) epsilonrange_list.add (new Epsilonrange_listelement (Dist, next_dataobject));
            if (Priorityqueue.size () < K) {Priorityqueue.add (dist, next_dataobject); } else {if (Dist < priorityqueue.getpriority (0)) {priorityqueue.next ();//shift the maximum distance
                In addition, to implement a fixed-length queue Priorityqueue.add (dist, next_dataobject); }}} while (Priorityqueue.hasnext ()) {nextneighbours_list.add (0, Priorityqueue.next ()),//writes the priority queue to the list, and each time it is added to the index0 to see this L The ist is an ascending list.
} return_list.add (Nextneighbours_list);
        Return_list.add (epsilonrange_list);
    return return_list; }

The design of this function must be in the Groove: the first convention-based programming, which contracts Index0 and index1 data. It also agreed on the objects stored in the list. It is also agreed that the ascending order of elements in the priority queue makes this function reusable and its low.

The second and epsilonrangequery are partially repeated (but cannot be called epsilonrangequery, as the equivalent of all objects traversed two times).

(3) Coredistance. The function not only returns the list of the above functions, but also adds the index3 to the farthest and smaller than the Epsilon object.

public list coredistance (int minpoints, double epsilon, DataObject DataObject) {
        List list = k_nextneighbourquery (min Points, Epsilon, dataObject);

        if ((List) List.get (1)). Size () < minpoints) {
            list.add (new Double (dataobject.undefined));
            return list;
        } else {
            list nextneighbours_list = (list) list.get (0);
            Priorityqueueelement priorityqueueelement =
                    (priorityqueueelement) nextneighbours_list.get (nextNeighbours_ List.size ()-1);
            if (priorityqueueelement.getpriority () <= epsilon) {
                list.add (new Double (Priorityqueueelement.getpriority ()) );
                return list;
            } else {
                List.add (new Double (dataobject.undefined));
                return list;}}
    }

Third, Buildclusterer

Then, from Buildclusterer, the function is the entrance to all the clusters. Used to train a cluster using a known sample.

The function itself is relatively simple.

  public void Buildclusterer (Instances Instances) throws Exception {//First measure if this instance can be clustered with Dbscan.

        Dbscan can handle all types (enumerations, dates, values, Missingvalue) getcapabilities (). Testwithfail (instances);

        Long time_1 = System.currenttimemillis ();
        Processed_instanceid = 0;
        numberofgeneratedclusters = 0;

        Clusterid = 0;
        Replacemissingvalues_filter = new Replacemissingvalues ();
        Replacemissingvalues_filter.setinputformat (instances);

        Instances filteredinstances = Filter.usefilter (Instances, Replacemissingvalues_filter);
        Database = Databaseforname (Getdatabase_type (), filteredinstances); for (int i = 0; i < database.getinstances (). Numinstances (); i++) {DataObject DataObject = Dataobjectfornam E (Getdatabase_distancetype (), database.getinstances (). instance (i), Integer.tostrin
            G (i), database);
 Database.insert (dataObject);//INSERT INTO database}       Database.setminmaxvalues ();
        Iterator Iterator = Database.dataobjectiterator (); while (Iterator.hasnext ()) {//is not the most efficient iteration for all nodes, assuming that a variable is used to record the current number of unclassfied, a direct exit is more efficient when 0, although the time complexity does not change.
DataObject DataObject = (DataObject) iterator.next (); if (dataobject.getclusterlabel () = = dataobject.unclassified) {if (Expandcluster (DataObject)) {//Assuming a point is not marked,
                    Then try to extend the clusterid++;
                numberofgeneratedclusters++;
        }}} Long time_2 = System.currenttimemillis (); ElapsedTime = (double) (time_2-time_1)/1000.0;//very strange, the implementation of Weka has different programming styles, at least the previous cluster or classifier. It is not directly in the training function to calculate the time spent.
}

Iv. Expandcluster

The extension core node is the main function of a cluster, and if successful expansion returns TRUE, False is returned, such as the following:

Private Boolean Expandcluster (DataObject DataObject) {List seedlist = Database.epsilonrangequery (Getepsilon (), Da Taobject);//This function looks for objects within a given object distance epsilon if (seedlist.size () < getminpoints ()) {Dataobject.setclusterla Bel (dataobject.noise);//assumptions are non-core objects.
            Temporarily set to noise, then assume that the core object cannot be clustered until it has been noise.
        return false;  }//Go here is the core object for (int i = 0; i < seedlist.size (); i++) {DataObject Seedlistdataobject =
            (DataObject) Seedlist.get (i); Seedlistdataobject.setclusterlabel (Clusterid);//All Seedlist objects belong to Clusterid, this clusterid is a self-increment if (seedlistdata Object.Equals (DataObject)) {seedlist.remove (i);//note that epsilonrangequerylist will put the object itself in.
            So here to remove the i--; }} for (int j = 0; J < Seedlist.size (); j + +) {DataObject Seedlistdataobject = (dataobjec
            T) Seedlist.get (j); List Seedlistdataobject_neighbourhood = Database.epsilonrangequery (gEtepsilon (), seedlistdataobject);
                For each element in Seedlist, look for an element within its realm if (Seedlistdataobject_neighbourhood.size () >= getminpoints ()) { for (int i = 0; i < seedlistdataobject_neighbourhood.size (); i++) {//go to this loop to describe the core object DataObject p
                    = (DataObject) seedlistdataobject_neighbourhood.get (i); if (p.getclusterlabel () = = Dataobject.unclassified | | P.getclusterlabel () = = dataobject.noise) {<span style= "
                            White-space:pre "> </span> if (p.getclusterlabel () = = dataobject.unclassified) { Seedlist.add (P);//If it is not classified, it is added to the seedlist. Unclassified is used here to ensure that no repetition is added, and Nosie does not join because noise is certainly not the core object (the logic guarantee at the beginning of this function) is a trick. Using a list to add subscript to the effect of the set, assuming let me implement the expected I will directly use set Bar} p.setclusterlabel (Clusterid);// Set to the corresponding cluster}}} seedlist.remove (j);//Not very clear why to remove here. It is no longer necessary to delete after traversal. Perhaps in order to save memory. AlsoPerhaps the author of obsessive-Compulsive disorder (the author of this code does not seem to like using iterators, and many times using subscript-based deletions, in Java this is not a very elegant way of programming, although I often use this) j--;
    } return true; }

Five, time complexity analysis

The Buildclusterer function main loop calls Eplisonrangequery for each element in the list as a n,expandcluster function. So it is n^2 that the whole algorithm is n^3 and not very efficient.

Optimization points:

Buildclusterer does not produce optimizations that are better than O (n), but is able to use counters to record unmarked quantities to improve some efficiency, and expandcluster no optimization points. But eplisonrangequery at least two places to be able to optimize, the first is to use Kdtree (like the Xmean algorithm, see the previous blog) to more effectively find the distance from the distance to the nearest point, followed by using the cache to buffer some of the distance to the fixed-point pair. Due to the fact that the same point is actually calculated several times in the program.

Liu, Clusterinstance

This function receives a instance as a parameter and should return the instance dependent cluster. But Dbscan does not seem to have done so.

    public int clusterinstance (Instance Instance) throws Exception {
        if (Processed_instanceid >= database.size ()) Processed_instanceid = 0;
        int cnum = (Database.getdataobject (integer.tostring (processed_instanceid++))). Getclusterlabel ();
        if (Cnum = = dataobject.noise)
            throw new Exception ();
        else
            return cnum;
    }

The subscript of the use case with ID 0,1,2 is returned in turn, not knowing what it is intended to do.

It is assumed that a noise throws an exception directly and does not explain why the exception was thrown at all.

The whole function is of unknown significance.

Vii. Summary

If you want to write a summary, then I personally for this code is more disappointing, whether it is some function abstract design, data structure design, Java code style, there is a thick "amateur" flavor, and the previous classifier neat code compared to completely the same (well, it was not written by a person).

In addition to the final clusterinstance behavior and gaze completely inconsistent, do not know whether it is a bug or feature or other causes.

Weka algorithm Clusterers-dbscan Source code Analysis

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More