Preface
In the unsupervised algorithm for machine learning, it is possible C The mean algorithm appears earlier (as I saw in "data mining" by Professor Jiawei Han last century ) and several variants may also be the most widely used type of unsupervised algorithms, except that the term machine learning is not widely used, but is called data mining ( Data Mining), but no matter what the name is,its nature is not different, and it is important to note that this algorithm is also called the K -means algorithm in the industry (i.e. Kmeans).
in the previous discussion of the mixed Gaussian clustering method, the author mentioned the intention to elaborate the algorithm (perhaps too simple or it is already rotten street?). Just like the minst Library is played with a remnant), but it has a so-called fuzzy C - means clustering, this model is still some of the value of discussion.
for the general clustering algorithm, in fact, we mainly consider the two principles, one is the degree of intra-class aggregation and the degree of separation between classes, so the following discussion is generally around these two principles, including the related density-based non-parametric clustering algorithm, which will be written later, Such as peak shaving method and stroke method (in fact, for other non-clustering algorithms, also generally need to follow these two principles, and for the supervised classification algorithm is not exactly the same, because the sample point has been identified, that is, labeled).
Basic
Cmean value algorithm
according to the principle of class aggregation, the relative classification of a sample is of course the smaller the difference in the class, and the greater the difference between classes, the better, so that we can achieve the requirements of clustering, but the use of different class aggregation evaluation function will lead to different clustering results, It is customary to use this indicator of sample variance to assess the degree of difference between them.
If we use the soft class to divide the matrix, we can give the following formula:
650 "this.width=650;" src= "https://s4.51cto.com/wyfs02/M01/A7/ 6b/wkiol1nmmq7yquzvaaaocbm7ins070.png-wh_500x0-wm_3-wmp_4-s_4185621968.png "title=" QQ20171017082021.png "alt=" Wkiol1nmmq7yquzvaaaocbm7ins070.png-wh_50 "/>
where 650 "this.width=650" width= "" height= "" src= " /e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/themes/default/images/word.gif") No-repeat center;border:1px solid #ddd, "alt=" Spacer.gif "/>, Span style= "Font-size:14px;font-family:calibri, ' Sans-serif ';" >650) this.width=650; "Width=" height= "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ( "/e/u261/themes/default/images/word.gif") no-repeat center;border:1px solid #ddd; "alt=" Spacer.gif "/> Span style= "font-family: ' The song Body '; > represents the i is the evaluation function of the difference between the sample and the class, so we can obviously get the expression of the relevant objective function according to the above formula:
650) this.width=650; "Src=" https://s1.51cto.com/wyfs02/M01/A7/6B/wKioL1nmmTiTqd-cAAAOMvWEitY052.png-wh_500x0-wm_ 3-wmp_4-s_681459578.png "title=" Qq20171017082021.png "alt=" Wkiol1nmmtitqd-caaaomvweity052.png-wh_50 "/>
For the above, we need to solve two sets of parameters, according to the previous experience of machine learning, we can cross, that is, to fix a set of parameters, solve another group, and then optimize another group.
First, the parameter class X is derivative and the result is 0, we have:
650) this.width=650; "Src=" https://s1.51cto.com/wyfs02/M00/A7/6B/wKioL1nmmafQs0DFAAAQznBhaGE338.png-wh_500x0-wm_ 3-wmp_4-s_425475045.png "title=" Qq20171017082021.png "alt=" Wkiol1nmmafqs0dfaaaqznbhage338.png-wh_50 "/>
according to the formula 3 The following results can be obtained:
650) this.width=650; "Src=" https://s4.51cto.com/wyfs02/M00/08/BC/wKiom1nmnIzhY-t6AAAPcA93_MU949.png-wh_500x0-wm_ 3-wmp_4-s_3252044975.png "title=" Qq20171017082021.png "alt=" Wkiom1nmnizhy-t6aaapca93_mu949.png-wh_50 "/>
then fix the parameters X (Note that this is a dimension of C vectors) to solve the parameters again u (there is no way to bias the equation 2 ). Our understanding of Equation 4 can be this, the first randomization of the attribution matrix, and then calculate the relevant values within the class, then how to solve the parameter u? Generally we just consider for each sample, its distance to the class is not greater than a threshold value, if it is the assignment of the degree matrix to 1, while the other components are changed to 0, otherwise modified to 0, and the value of its own degree is added to other components (in fact, the final attribution matrix will become one-hot-vector).
so the cycle repeats until the whole converges. The above-mentioned algorithm steps are not quite the same as the general C -means clustering algorithm introduced on line, the traditional method is as follows:
in the sample collection, select C points as the initial class center;
Select one of the remaining sample points, calculate its distance to each center point, select the shortest distance to classify it as that category;
Select the next sample, repeat 2 until all the samples have been calculated, if the set does not change or reach the upper iteration limit then turn 5 Otherwise turn 4;
according to the current class division, re-center point, repeat step 2;
The end algorithm.
It can be seen that the complexity of the above algorithm is O (NCt) , where N is the number of samples, C is the number of categories, and T is the number of iterations.
so What's wrong with the C-mean algorithm? One is that the results of the final partitioning are still hard to classify, and the selection of the initial values can lead to different results (which can reach local saddle points) or too slow convergence rates.
Blur
Cmean value algorithm
in the previous section of the C in the mean algorithm, even if the element initial value of the attribution matrix is not an integer, the hard classification of the sample will still be obtained after execution of the algorithm, which is not applicable in many cases, so this section mainly discusses the fuzzy C The mean value algorithm, that is, after the completion of the classification algorithm, we get the sample soft classification results.
to achieve this, we add a constraint to equation 2 and use the Lagrange multiplier method, where the formula becomes the following form (where for each element of the attribution matrix we add m, which is called weighted attribution, which is generally greater than or equal to 1):
650) this.width=650; "Src=" https://s2.51cto.com/wyfs02/M01/08/BC/wKiom1nmnNSzWzjOAAAQrbBlYkA547.png-wh_500x0-wm_ 3-wmp_4-s_303801343.png "title=" Qq20171017082021.png "alt=" Wkiom1nmnnszwzjoaaaqrbblyka547.png-wh_50 "/>
for the above, the later part is the added constraint, it can be seen that the constraint is 650) this.width=650; "Width=" height= "src="/e/u261/themes/default/images/spacer.gif "style=" background: URL ("/e/u261/themes/default/images/word.gif") no-repeat center;border:1px solid #ddd; "alt=" Spacer.gif "/> , then the same difference to the formula 5 for each parameter of the bias (and the above section, because the use of Lagrange multiplier method contains three parameters need to solve), we have:
650) this.width=650; "Src=" https://s1.51cto.com/wyfs02/M01/08/BC/wKiom1nmnQmi6ed4AAAP-pfxL3Q664.png-wh_500x0-wm_ 3-wmp_4-s_1530445475.png "title=" Qq20171017082021.png "alt=" Wkiom1nmnqmi6ed4aaap-pfxl3q664.png-wh_50 "/>
This formula is not much different from the previous formula, but for the parameter 650) this.width=650, "width=" height= "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/ U261/themes/default/images/word.gif ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/> 5
650) this.width=650; "Src=" https://s2.51cto.com/wyfs02/M02/A7/6C/wKioL1nmmoHRO6ZLAAASOxl60zQ928.png-wh_500x0-wm_ 3-wmp_4-s_2310748007.png "title=" Qq20171017082021.png "alt=" Wkiol1nmmohro6zlaaasoxl60zq928.png-wh_50 "/>
How did the above formula come about? We have the following derivation ("Machine learning" This book does not carry out the relevant introduction, now I explain the process to avoid misunderstanding), for the parameter λ deviation, we can get the following formula:
650) this.width=650; "Src=" Https://s4.51cto.com/wyfs02/M00/A7/6C /wkiol1nmmqawooaxaabtfuay2kq389.png-wh_500x0-wm_3-wmp_4-s_2755036587.png "title=" QQ20171017082021.png "alt=" Wkiol1nmmqawooaxaabtfuay2kq389.png-wh_50 "/>
comprehensive formula 9 10 j is used here. Span style= "font-size:14px;font-family: ' The song Body '; > instead of i
650) this.width=650; "Src=" https://s4.51cto.com/wyfs02/M01/A7/6C/wKioL1nmmsrhnX-XAABRlSXYA8w268.png-wh_500x0-wm_ 3-wmp_4-s_3045276977.png "title=" Qq20171017082021.png "alt=" Wkiol1nmmsrhnx-xaabrlsxya8w268.png-wh_50 "/>
then bring the above formula back to the formula Ten can get the final result, the proof.
Introduction to C-mean algorithm in machine learning