Preface
in the previous articles on clustering algorithms, the main content of the author is related to parametric solution, such as C mean value (including Blur C mean), mixed Gaussian model, and for some nonparametric density estimation algorithms are not discussed, and generally based on the parameter density estimation of the algorithm is established in the hypothetical probability distribution family (such as Gaussian distribution, polynomial distribution, etc.) based on In practice, we may not be able to pre-estimate or assume that the probability distribution of the sample is exactly what kind of situation, so the common method of this situation may be helpless. Therefore, a multi- peak method with no parameter density estimation is given in this paper, and the clustering algorithm can be obtained theoretically.
This paper discusses the clustering algorithm of the nonparametric density estimation, which mainly includes the peak algorithm and the mean shift algorithm , we mainly focus on the peak algorithm, but for the mean drift is not too much discussion, interested readers can refer to the relevant literature.
in the discussion of the peak algorithm, including the peak shaving method and the stroke method, the common points of the two algorithms are to calculate the density of the sample, while the stroke method has other better properties, the author will give the relevant description.
Cutting Peak Method
in the previous article, I discussed the related clustering algorithm, mentioned that there are several problems in the traditional clustering algorithm: such as the cluster center (this is generally random), the number of iterations (this may need to see convergence, if the local saddle point will have problems), convergence error, etc. Therefore, it will result in a certain sense of uncertainty, that is, there is no good robustness, if the initial selection of the inappropriate cluster center, will lead to clustering failure, without any loss of computing resources.
in the 1994 years,Yager and Filev proposed a new clustering method, which can be used as the initial Cluster Center selection method, The basic idea of this algorithm is based on the fact that the density of the points that can become the center of the cluster must be very large, and the distance between the centers of clustering must be larger.
So how to define the so-called sample point density, given by the following formula:
650) this.width=650; "Src=" Https://s2.51cto.com/oss/201710/19/c5748b4ad32420d0b71c7767036c24fb.png-wh_500x0-wm_3 -wmp_4-s_2253228517.png "title=" Qq20171019174120.png "alt=" C5748b4ad32420d0b71c7767036c24fb.png-wh_ "/>
through the above can be seen, in fact, the density of the so-called sample point is a point to other points of the distance (here using a Gaussian distance) of the sum, then if the density of a point of the greater the more the sample points around it, then it is more likely to become the center of the cluster, so the beginning of the algorithm is to calculate the Select the largest one as the first cluster center, it can be seen in the selection of cluster center, there is no like Kmeans algorithm, such as random selection of C (assuming that the number of clusters is C ), so from this point of view the algorithm itself avoids a certain blindness.
So how do you choose the remaining cluster centers? Here we have to use the so-called cutting-peak idea:
650) this.width=650; "Src=" Https://s5.51cto.com/oss/201710/19/53442ba4a6d7e503c21a818d577a0aab.png-wh_500x0-wm_3 -wmp_4-s_1642157518.png "title=" Qq20171019174120.png "alt=" 53442ba4a6d7e503c21a818d577a0aab.png-wh_ "/>
in the above formula, 650) this.width=650; "Width=" height= "src="/e/u261/themes/default/images/spacer.gif "style=" background: URL ("/e/u261/themes/default/images/word.gif") no-repeat center;border:1px solid #ddd; "alt=" Spacer.gif "/> Indicates the density of the most dense sample points identified by the first (first Class), then the most dense sample point is identified as the second cluster center based on the above formula, and the effect of clipping is to subtract the density of the first cluster center from all the sample points, and the closer point of the first cluster Center is the more obvious the peak clipping effect. , this embodies the idea of high aggregation and maximal separation between classes, and then the algorithm calculates the density of the center of the cluster or the first cluster center density less than a constant, namely:
650) this.width=650; "Src=" Https://s4.51cto.com/oss/201710/19/95b3560426d72106ac1b911e2072bcb5.png-wh_500x0-wm_3 -wmp_4-s_2957230687.png "title=" Qq20171019174120.png "alt=" 95b3560426d72106ac1b911e2072bcb5.png-wh_ "/>
for the above algorithm, the time complexity of finding all the clustering points is O (CN) , through the description of this algorithm can be seen, in fact, for the cluster is not the end, we also need to classify the remaining points in the collection into their respective classes, but this is not very important, using the clipping method and the previous correlation method can easily process the sample, and the algorithm's robustness is better. But how to choose σ in fact or the results have a certain effect, that is, if the selection is too large coverage, may not reach the purpose of clustering, and the choice is too small, it is possible that the algorithm will be in the middle of the end (not to reach C), Therefore, I suggest that you can add a function to this parameter to approximate its possible optimal value (the following is to try this better value will bring the algorithm to run longer).
Peak Stroke Method
Although the original clipping method is intuitive (the intuition here is simply that the algorithm is more intuitive) and simple, but if the sample data is high-dimensional (is it a dimension disaster?) This is obviously not the case), the performance of clustering is difficult to observe directly. To solve this problem,Alex Rodriguez and Alessandro laio A new mountain clustering approach in the year. This algorithm can visually display the clustering effect of the sample points in two-dimensional coordinates.
How do you define this two-dimensional coordinate axis? Here we need to refer to the following formula, as well, we define the sample point density:
650) this.width=650; "Src=" Https://s4.51cto.com/oss/201710/19/17914fd790afa8f6b8db8937b661cb42.png-wh_500x0-wm_3 -wmp_4-s_406140429.png "title=" Qq20171019174120.png "alt=" 17914fd790afa8f6b8db8937b661cb42.png-wh_ "/>
The above formula defines the density of a point (which is actually the degree of aggregation), where one dimension of the axis is it, and the other is the so-called degree of separation:
650) this.width=650; "Src=" Https://s2.51cto.com/oss/201710/19/981cdc7b9611d3a6e279d7a0add4467a.png-wh_500x0-wm_3 -wmp_4-s_2780561705.png "title=" Qq20171019174120.png "alt=" 981cdc7b9611d3a6e279d7a0add4467a.png-wh_ "/>
formula 5 Span style= "Font-size:14px;font-family:calibri, Sans-serif;" >650) this.width=650; "Width=", "height=", "src="/e/u261/themes/default/images/spacer.gif "style=" Background-image:url ("/e/u261/themes/default/images/word.gif"); Background-position:center;background-repeat: no-repeat;border:1px solid RGB (221,221,221); "alt=" Spacer.gif "/> 650) this.width=650, "width=", "height=", "src="/e/u261/themes/default/images/spacer.gif "style=" Background-image:url ("/e/u261/themes/default/images/word.gif"); background-position:center;background-repeat:no-repeat;border:1px Solid RGB (221,221,221); "alt=" Spacer.gif "/>
after the selection of the cluster center, the other points of the attribution is first by density from high to low, the highest density of the non-standard fixed point and the most recently calibrated to the same class, so repeated calculation, like a mountain, so this algorithm is called the stroke method.
A description of peak clustering algorithm in machine learning