Weka algorithm Clusterers-xmeans source code Analysis (i)

Last Update:2017-08-20 Source: Internet

Author: User

Tags random seed

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

<p></p><p><span style= "font-size:18px" > on several blogs are analysis of the classifier algorithm (supervised learning), this time to analyze a clustering algorithm (unsupervised learning). </span></p><p><span style= "font-size:18px" ></span></p><p><span style= "font-size:18px" > First, Algorithm </span></p><p><span style= "font-size:18px" > Xmeans algorithm is basically the famous K-means algorithm. Then Weka made a little "small" improvement. So that they can proactively determine the number of clusters, then first of all, say the K-means algorithm. By the way, the Weka native Kmeans algorithm is a simplekmeans cluster. </span></p><p><span style= "font-size:18px" >k-means algorithm is a typical simple but effective algorithm, with a very intuitive aesthetic. The steps are as follows: </span></p><p><span style= "font-size:18px" > Input: Number of clusters K, and dataset data</span></ P><p><span style= "font-size:18px" >1, randomly selected K points as a cluster center </span></p><p><span style= " Font-size:18px ">2, for each use case in the data set. Find out the nearest cluster center I. Classify this use case in Class I.
 </span></p><p><span style= "font-size:18px" >3, for each classification, once again compute the cluster center </span></p ><p><span style= "font-size:18px" >4, repeated 2 and 3. Until the condition of the iteration exit is reached. 
The time complexity of </span></p><p><span style= "font-size:18px" >k-means is O (SNK). S is the number of iterations, and the condition selection of the exit iteration. n is the number of data sets. K is the number of clusters, can be seen, in the number of clusters is not much in the case. The algorithm is still relatively efficient. </span></p><p><span style= "font-size:18px" > but the disadvantage of K-means below two: </span></p><p ><span style= "font-size:18px" >1, instability, the final cluster results and the initial cluster center has a very large relationship. </span></p><p><span style= "font-size:18px" >2, can only handle continuous values, cannot handle discrete values. </span></p><p><span style= "font-size:18px" > for 1. The K-means extended k-means++ algorithm was produced for 2. The k-modes algorithm and the K-prototype algorithm are available. Interested readers will be able to search, this is not the start of the said. </span></p><p><span style= "font-size:18px" >k-means algorithm key has the following points:</span></p> <p><span style= "font-size:18px" >1, how to calculate "distance" </span></p><p><span style= between each use case Font-size:18px ">2, the so-called" iteration exit condition "is what </span></p><p><span style=" font-size:18px ">3, How to determine the cluster center </span></p><p><span style= "font-size:18px" >4, in the implementation process there are some to improve the efficiency of trick</span& Gt; </p><p><span style= "font-size:18px" > This blog in the analysis of source code will focus on solving the above 4 issues. </span></p><p><span style= "font-size:18px" ></span></p><p><span style= "font-size:18px" > II, Source code </span></p><p><span style= "FONT-SIZE:18PX" > Weka.clusterers.Xmeans inherits from the Randomizableclusterer class (which is the unstable cluster from the name, which can pass in a random number seed), and the latter inherits from the Abstractclusterer (which contains two key virtual methods Buildclus Terer and Clusterinstance). So we focus on the implementation of Xmeans to Buildclusterer and clusterinstance </span></p><p><span style= "font-size:18px" The >xmeans method can only handle continuous numeric values, dates, and missingvalue, and can be seen from the getcapabilities.
</span></p><p><span style= "font-size:18px" ></span></p><p><span Style= "font-size:18px" >1, Buildclusterer</span></p><p><span style= "font-size:18px" > The method accepts instances as a parameter. The function is to train the cluster model.
</span></p><p><span style= "font-size:18px" ></span></p><pre name= "code"    Class= "java" > public void Buildclusterer (Instances data) throws Exception {///the properties of this data can be processed first.    Getcapabilities (). Testwithfail (data); These two are the minimum number of clusters and the maximum number of clusters

    if (M_minnumclusters > M_maxnumclusters) {      throw new Exception ("Xmeans:min Number of clusters"          + "can ' t be gr Eater than max number of clusters! ");    }    m_numsplits = 0;    M_numsplitsdone = 0;    M_numsplitsstilldone = 0;    Replace the Missingvalue, assuming it is a numeric type, then replace with an average. If it is an enumerated type, it is replaced with the value that appears most

//here is a little trick to calculate preprocessing data M_replacemissingfilter = new Replacemissingvalues ();    M_replacemissingfilter.setinputformat (data);        M_instances = Filter.usefilter (data, m_replacemissingfilter);    Set a random seed randomly random0 = new random (m_seed); The number of clusters starts with the minimum number of clusters.    This value defaults to 2 m_numclusters = m_minnumclusters; Here is the default method of calculating distances, which can be passed to the function you define.    European distance is used by default.    if (M_DISTANCEF = = null) {M_DISTANCEF = new euclideandistance ();    }//These two functions are not realized, do not know what the intention is to put here m_distancef.setinstances (m_instances);   Checkinstances ();

    Test-related, temporarily ignoring if (m_debugvectorsfile.exists () && m_debugvectorsfile.isfile ()) initdebugvectorsinput ();     Allinstlist Store all Instances subscript int[] allinstlist = new int[m_instances.numinstances ()];    for (int i = 0; i < m_instances.numinstances (); i++) {allinstlist[i] = i;    }//Just copy a table header M_model = new Instances (m_instances, 0); Determines the cluster center if (m_centerinput! = null) {//cluster center is able to read from the file. Note that m_clustercenters itself is a instances object.      But it does not seem to be inferred that this m_clustercenters and M_model (that is, the incoming training set) are isomorphic m_clustercenters = new Instances (m_centerinput); M_numclusters = M_clustercenters.numinstances ();//If the cluster center file is passed in, update the number of cluster centers} else//randomly select the cluster center.      There are random samples that are put back.    M_clustercenters = makecentersrandomly (Random0, m_instances, m_numclusters); PFD (D_followsplit, "\n*** starting Centers");//This is the debug function, ignoring for (int k = 0; K < m_clustercenters.numinstances ();    k++) {PFD (D_followsplit, "Center" + K + ":" + m_clustercenters.instance (k)); } PRCENTERSFD (d_printcenters);//Logging function, ignoring boolean finished = false;     Instances children; Whether to use Kdtree, simply say Kdtree, given a heap of points x, and given a point a. A from the nearest point in X, the traditional approach iterates through the X collection.      Find out the recent, time complexity is O (n), after building kdtree (essentially spatially indexed), the time complexity can be O (logn) if (m_usekdtree) m_kdtree.setinstances (m_instances);    Number of iterations M_iterationcount = 0; /** * The training process consists of two iterations. The outer iteration is the division of the Cluster center. The inner iteration divides each instance and calculates a new cluster center, and the exit condition of the outer iteration is two * 1. Finished is True (finished is true after the condition is said) * 2. Maximum number of iterations reached

 * Note that M_clustercenters may already be larger than m_maxclusters, since it may be a clustered center read from a file, in which case the iteration will also be      Since finish is inferred at the end of the loop, the/while (!finished &&!stopiteration (M_iterationcount, m_maxiterations)) {      PFD (D_followsplit, "\nbeginning of Main loop-centers:");      PRCENTERSFD (D_followsplit); PFD (D_itercount, "\n*** 1. Improve-params "+ M_iterationcount +".      Time ");      m_iterationcount++; The converged represents two iterations of the inner layer.      Whether the resulting cluster result is the same as Boolean converged = false; This is a one-dimensional array.      Record each instance being divided into which cluster center m_clusterassignments = initassignments (M_instances.numinstances ()); This two-dimensional array holds each cluster center with those instances, and it's very strange that weka are all using arrays. Instead of the list data structure.      is expected to be considered in terms of efficiency.      int[][] instofcent = new Int[m_clustercenters.numinstances () [];      Inner iteration counter int kmeansiteration = 0; Hit log ignores PFD (D_followsplit, "\nconverge in K-means:");

In the inner iteration, there are two conditions for the inner iteration exit. The first is the maximum number of iterations. The second is a clustering result of two cycles as while (!converged &&!stopkmeansiteration (kmeansiteration, M_maxkmeans)) {Kmeansiteratio N++;converged = true; The example is assigned to the corresponding cluster center, where the converged is assigned. But it's covered later, so this assignment doesn't make sense. This function is more troublesome but there is no algorithm thinking, do not start analysis, kdtree structure may be in the back of the blog to analyze the fact now.

converged = Assigntocenters (m_usekdtree?)

M_kdtree:null, M_clustercenters, Instofcent, Allinstlist, m_clusterassignments, kmeansiteration); PFD (D_followsplit, "\nmain loop-assign-centers:");//Logging ignores PRCENTERSFD (d_followsplit);//logging ignores//once again the cluster center, assuming two times the same cluster center. Returns true, like the two-time clustering center. is exactly equivalent to the two-time clustering result. Clustering centers are calculated by means of arithmetic averages.

converged = Recomputecenters (m_clustercenters,//Clustering Center instofcent,//Examples of these cluster centers M_model); Table Head PFD (d_followsplit, "\nmain loop-recompute-centers:"); PRCENTERSFD (D_followsplit); } PFD (D_followsplit, ""); PFD (D_followsplit, "End of part:1. improve-params-conventional K-means"); Calculates the deviation of each cluster center. M_mle is the sum of the M_MLE = Distortion (instofcent, m_clustercenters) of the distance between the instances in the cluster and the cluster centers.

      BIC is the "Bayesian distortion rule", the smaller the model to the data fit the better, Baidu Encyclopedia connection http://baike.baidu.com/view/1425589.htm?fr=aladdin#2. Anyway the smaller the better      m_bic = Calculatebic (instofcent, M_clustercenters, m_mle);      PFD (D_followsplit, "m_bic" + m_bic);      int currnumcent = M_clustercenters.numinstances ();

The new cluster center, can meet, each of the original cluster center to be split, because the capacity is currnumcent*2 Instances splitcenters = new Instances (m_clustercenters, Currnu            Mcent * 2);      double[] pbic = new double [currnumcent];                  double[] cbic = new double [currnumcent]; Split the Center for (int i = 0; i < currnumcent//original remark added that the next line would be able to lift the speed.    I am not very understanding//&& Currnumcent + numsplits <= m_maxnumclusters; i++) {PFD (d_followsplit, "\nsplit Center" + i + "" + m_clustercenters.instance (i)); Instance Currcenter = M_clusterc Enters.instance (i); int[] Currinstlist = Instofcent[i];int Currnuminst = instofcent[i].length;//Represents a few instances of this cluster// Assume that the current instance is less than or equal to 2. Just copy yourself and each cluster center must be split. Of course, assuming that two instance, each point as a cluster center can also. But direct dummy oneself also does not affect the final result.
if (Currnuminst <= 2) {pbic[i] = Double.max_value;  Cbic[i] = 0.0;  Add center itself as dummy Splitcenters.add (currcenter);  Splitcenters.add (Currcenter); Continue;} M_mle[i] represents the distance error on the cluster I and. Divide by the number of categories to get an average error, but this error is not a variance. The name of this variable is a bit misleading ...
。 Double variance = m_mle[i]/(double) currnuminst;

        Split into two centers in some way. This splitting process is quite interesting. The main process will then be specifically analyzed children = Splitcenter (Random0, Currcenter, Variance, M_model);//Prepare to use all the data on this cluster, based on the two new cluster centers, and then do a cluster int[] onecentassignments = Initassignments (Currnuminst); int[][] instofchcent = new int [2][]; The Todo maybe split didn ' t work//Flag records whether the two iterations are the same. The following loop logic is basically the same as the previous clustering process converged = False;int kmeansforchildreniteration = 0; PFD (D_followsplit, "\nconverge, K-means for children:" + i) and while (!converged &&!stopkmeansiteration (K    Meansforchildreniteration, M_maxkmeansforchildren)) {kmeansforchildreniteration++;  converged = Assigntocenters (children, instofchcent, currinstlist, onecentassignments);  if (!converged) {Recomputecentersfast (children, instofchcent, M_model);//The only difference between this and recomputecenters is that it doesn't count as converged. }} splitcenters.add (Children.instance (0)); Splitcenters.add (Children.instance (1)); PFD (D_followsplit, "\nconverged Cildren"); PFD (D_followsplit, "" + children.instance (0)); PFD (D_followsplit, "" + children.insTance (1));//Calculate parent Cluster Center and Sub-Cluster Center (2) Bicpbic[i] = Calculatebic (currinstlist, Currcenter, M_mle[i], M_model);d ouble[]      Chmle = Distortion (instofchcent, children); Cbic[i] = Calculatebic (instofchcent, children, CHMLE); }//For each of the cluster centers do the above, loop end//This function calculates a new cluster center based on the BIC previously calculated.      Specific how to choose the later will follow up to the specific said.      Instances newclustercenters = null; Newclustercenters = Newcentersaftersplit (Pbic, Cbic, M_cutofffactor, split      Centers);      int newnumclusters = Newclustercenters.numinstances (); if (newnumclusters! = m_numclusters) {//assuming that the number of new cluster centers is not equal to the old one, enter this if. PFD (D_followsplit, "Compare with Non-split"); int[] newclusterassignments = initassignments (M_instances.numinstances ( ); int[][] newinstofcent = new Int[newclustercenters.numinstances ()][];//put all the instance on the new cluster center. converged = Assigntocenters (M_usekdtree m_kdtree:null, Newclustercenters, newinstofcent, AllInstList, NE Wclusterassignments, M_iterationcount);d ouble[] Newmle = Distortion (newinstofcent, newclustercenters);d ouble newbic = Calculatebic (newinstofcent, Newclustercenters, newMle);// Calculate a new BICPFD (D_followsplit, "newbic" + newbic); if (Newbic > M_bic) {//Assuming that the new Bic is larger than the old one, indicating that the new cluster effect is good, replace the old PFD with the new (d_fol  Lowsplit, "* * * decide for new clusters");  M_bic = Newbic;  M_clustercenters = newclustercenters; m_clusterassignments = newclusterassignments;}        else {PFD (d_followsplit, "* * * keep old clusters");      }} newnumclusters = M_clustercenters.numinstances (); if ((newnumclusters >= m_maxnumclusters) | | (newnumclusters = = m_numclusters)) {finished = true;//the finish condition when the maximum number of classifications is reached, or no matter what is split.    Set to True} m_numclusters = Newnumclusters; } if (M_clustercenters.numinstances () > 0 && m_centeroutput! = null) {M_CENTEROUTPUT.PRINTLN (m_clu      Stercenters.tostring ());//output model used, ignoring m_centeroutput.close ();    M_centeroutput = null; }      }

(not to be continued)

Weka algorithm Clusterers-xmeans source code Analysis (i)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More