The previous article does not know why the previous section is garbled. I will write a summary after this article to summarize the previous section.
First, we will solve two problems. The first is splitcenter, which is used to split existing centers, and the second is newcentersaftersplit. Based on the split Bic, We will calculate a new cluster center, this split mechanism is the biggest difference between xmeans and kmeans.
I. splitcenter
Protected instances splitcenter (random, instance center, double variance, instances model) throws exception {m_numsplits ++; algvector r = NULL; instances children = new instances (model, 2 ); if (convert () & m_debugvectorsfile.isfile () {instance nextvector = getnextdebugvectorsinstance (model); PFD (d_randomvector, "random vector from file" + nextvector ); R = new algvector (nextvector);} else {// This model is the header, R is to generate a random vector, each dimension is between 0 and 1 R = new algvector (model, random);} r. changelength (math. pow (variance, 0.5); // The length of the variable vector is SQRT (variance). This variance is the average deviation from the clustering point to the clustering center PFD (d_randomvector, "random vector * Variance" + r); // first generate the vector algvector c = new algvector (center); algvector C2 = (algvector) C. clone (); C = C. add (r); // C + R instance newcenter = C. getasinstance (model, random); children. add (newcenter); PFD (d_followsplit, "first child" + newcenter); // c2-r C2 = c2.substract (r); newcenter = c2.getasinstance (model, random); children. add (newcenter); PFD (d_followsplit, "Second Child" + newcenter); return children ;}
Result After execution:
Ii. newcentersaftersplit
Protected instances newcentersaftersplit (double [] pbic, double [] cbic, double cutofffactor, instances splitcenters) {// Boolean splitpercutoff = false; Boolean takesomeaway = false; boolean [] splitwon = initboolarray (m_clustercenters.numinstances (); // This array stores the int numtosplit = 0; instances newcenters = NULL; for (INT I = 0; I <cbic. length; I ++) {If (cbic [I]> pbic [I]) {// if the child's BIC Ratio If the BIC is large, split it. Why is the bigger the BIC, the better it is, not the smaller the Bic? WEKA's BIC formula does not seem to have a negative value. Splitwon [I] = true; numtosplit ++; PFD (d_followsplit, "center" + I + "decide for Children");} else {// The default value is false, you do not need to assign a value again. PFD (d_followsplit, "center" + I + "decide for parent") ;}} if (numtosplit = 0) & (cutofffactor> 0) {splitpercutoff = true; // if no node needs to be split, cutofffactor is used to determine the number of nodes to be split. The reason for this is to prevent local splitting. Numtosplit = (INT) (double) m_clustercenters.numinstances () * m_cutofffactor);} // subtract the pbic and cbic, and sort them in order to find the biggest business trip and split them first. Double [] diff = new double [m_numclusters]; for (Int J = 0; j <diff. length; j ++) {diff [J] = pbic [J]-cbic [J];} int [] sortorder = utils. sort (diff); // check the maximum number of splits: int possibletosplit = m_maxnumclusters-m_numclusters; If (possibletosplit> numtosplit) {// if the number of splits is greater than numtosplit, split possibletosplit = numtosplit;} else takesomeaway = true; // if there is a splitpercuteoff mark, it indicates that cutofffactor is used to determine the number of splits. In this case, splitwon must be false, you need to set a certain number to true if (splitpercutoff) {for (Int J = 0; (j <possibletosplit) & (cbic [sortorder [J]> 0.0 ); j ++) {splitwon [sortorder [J] = true;} m_numsplitsstilldone + = possibletosplit ;} else {// take some splits away if Max number of clusters wocould be exceeded if (takesomeaway) {int COUNT = 0; Int J = 0; // if this mark exists, this indicates that the number of splits is smaller than the number in splitwon. You must set a certain number of true values to falsefor (; j <splitwon. length & count <possibletosplit; j ++) {If (splitwon [sortorder [J] = true) Count ++;} while (j <splitwon. length) {splitwon [sortorder [J] = false; j ++ ;}}// perform the split operation, that is, if splitwon = true, otherwise, keep the original if (possibletosplit> 0) newcenters = newcentersaftersplit (splitwon, splitcenters); else newcenters = m_clustercenters; return newcenters ;}
Iii. Summary
First, let's review the entire algorithm process:
1. Randomly select a cluster center
2. Assign each instance to the nearest cluster center.
3. recalculate the new cluster center
4. Try to split the new cluster center
5. Return to 2. If the results of two consecutive cycles are the same, the process ends.
It can be seen that, compared with the traditional kmeans, the most important improvement of xmeans is that it can automatically determine the number of cluster centers and split them intelligently.
Finally, let's summarize the question raised at the beginning of the first article (although it is already garbled:
1. How to calculate the "distance" between various Use Cases"
A: The Euclidean distance is used by default, but you can customize the input distance function to calculate the distance between any two use cases.
2. What is the so-called iteration exit condition.
There are two layers of iteration: outer iteration and inner iteration. Each outer iteration generates different clustering centers, and each inner iteration distributes the use cases to each clustering center.
There are three exit conditions for outer iteration: (1) the maximum number of iterations is reached, (2) the number of cluster centers generated by the two outer iterations is equal, that is, the cluster center is not split, (3) maximum number of clusters
There are two exit conditions for the inner iteration: (1) the clustering centers allocated by all use cases in the two inner iterations are the same (2) the maximum number of iterations is reached.
3. How to Determine the cluster center
A: The arithmetic mean of all attributes is the cluster center.
4. Is there any trick used to improve efficiency in the implementation process?
Kdtree is used to find the nearest center of a case.
WEKA algorithm terers-xmeans source code analysis (2)