This presentation is an article on the science published by Alex and Alessandro in 2014 [13], the basic idea of the article is simple, but its clustering effect is both spectral clustering (spectral clustering) [11,14,15] And K-means characteristics, really aroused my great interest, the clustering algorithm is mainly based on two basic points:
- The density of a cluster center is higher than the density of its adjacent sample points
- The distance between a cluster center and a cluster center that is denser than its density is relatively large
Based on this idea, the number of cluster centers in the clustering process can be selected intuitively, and outliers can be automatically detected and excluded from cluster analysis. Regardless of the shape of each cluster, or the dimensions of the sample points, the results of the clustering analysis are satisfactory. Below I will mainly based on this article to detail the context of the clustering algorithm, and briefly review the relevant clustering algorithm.
Review of clustering algorithms
It is well known that the purpose of cluster analysis is to delimit samples into different clusters according to the similarity between samples, but the scientific definition of clustering seems to have not reached a consensus in academia. The paper [17] summarizes the clustering algorithm and finds that there are many clustering algorithms that I haven't studied yet. In K-means and K-medoids, each class cluster consists of a set of data that is closest to the respective cluster center. The objective function of the two is the sum of the distances of each sample point to the corresponding cluster center, and the process of re-updating the cluster center and allocating the cluster center to the sample point until convergence, as shown in 1. The difference between the two is that K-means's cluster center is the mean of all the sample points belonging to the cluster, and the K-medoids cluster Center is the smallest sample point of the cluster from all the sample points in the cluster. These two clustering algorithms are very simple to implement, and are very suitable for compact and hyper-spherical data distribution. But the flaws are also obvious:
- The lack of an effective mechanism to determine the number of clusters and the initial division;
- The strategy of iterative optimization cannot guarantee the global optimal solution;
- Very sensitive to outliers and noise.
In the clustering algorithm based on probability density, we assume that all kinds of clusters are produced by different probability density functions (2), and each sample point is subjected to these probability distributions with different weights. Unfortunately, it is often not feasible to solve parameters with maximum likelihood estimation in this kind of algorithm, and only one suboptimal solution can be obtained by iterative solution, and the desired maximization (expectation Maximization,em) is the most common strategy. In this kind of algorithm, the most typical is Gaussian mixture model (Gaussian Mixture model,gmm) [12]. The accuracy of such an algorithm depends on whether the pre-defined probability distribution fits well with the training data, but the problem is that in many cases it is not possible to know exactly what kind of probability distribution the data is similar to in general or local.
Clustering algorithms based on local density can easily detect clusters of arbitrary shapes. In dbscan[10], a user-given density threshold and domain radius are required as parameters, and sample points within the domain radius that are less dense than this threshold are considered noise points, and the remaining high-density non-connected regions are assigned to different cluster types, as shown in the pseudo-code below. However, choosing the appropriate density threshold is not an easy task, and the parameter estimation recommendations are described in [3]. The advantages of Dbscan [3] are summarized as follows:
- There is no need to specify the number of class clusters beforehand;
- Can be found in any shape of the cluster, 3 is shown;
- The noise point can be detected, and the noise point is strong in robustness.
- In addition to boundary points, clustering results (core points and noise points) are independent of the order of the sample points to be traversed
The shortcomings of Dbscan [3] are summarized as follows:
- For the boundary point, the clustering result of Dbscan is not completely definite. Fortunately, this is not a frequent occurrence and has little effect on the results of clustering. If the boundary point is also treated as a noise point, then the clustering result is deterministic.
- Clustering results depend on distance measurement rules. The most commonly used Euclidean distance in high-dimensional space due to "dimensional disaster" can hardly play an effective role, making it more difficult to set a suitable search radius.
- It is not suitable for datasets with large density differences, because the search radius and density thresholds are different for each cluster, making the selection of parameters more difficult.
DBSCAN (D, EPS, minpts)//eps:search radius//minpts:density threshold C = 0 for each unvisited point P in DataSet d< C2/>mark p as visited neighborpts = Regionquery (P, EPS) if sizeof (neighborpts) < minpts Mark P as noise
else C = Next cluster expandcluster (p, neighborpts, C, EPs, minpts) Expandcluster (p, neighborpts, C, EPS, MI NPts) add P to cluster C for each point Q of neighborpts if q is not visited Mark Q as visited Ne Ighborpts ' = Regionquery (Q, EPS) if sizeof (neighborpts ') >= minpts neighborpts = neighborpts joined with Nei Ghborpts ' If Q is not yet member of any cluster add Q to cluster C regionquery (P, EPS) return all Poin TS within P ' s eps-neighborhood (including p)
The clustering algorithm based on mean Shift (Mean-shift) [5,7,9] does not have to worry about the setting of search radius and density threshold, but also faces the problem of bandwidth selection, and the research on how to set up bandwidth can be found in [8,16]. The basic idea of mean-sift is to constantly look for the local maximum of the kernel density estimation function from the initial point to the Convergence (4 (a), as shown in the gradient rise), which represent the pattern of distribution. In the Mean-shift-based clustering algorithm, each sample point is then taken as the starting point of the mean-shift and then moved to a local standing point of the kernel density estimation function, and all samples that converge to the same standing point are divided into the same cluster, as shown in 4 (b). In general, in a density-based clustering algorithm, a class cluster can be defined as a collection of sample points that converge to the same local maxima of the density distribution function.
Clustering algorithm based on density peak and distance
The assumption of this clustering algorithm is that the local density of the sample points around the cluster center is lower than the local density of the cluster center, and the distance between the cluster center and the point with higher local density is relatively larger. The clustering effect is similar to Dbscan and Mean-shift, which can detect non-spherical clusters. The author claims toautomaticallyFind the number of clusters, although the text gives a bit of the idea of finding the number of clusters, but the MATLAB code provided does not implement this idea, or the need to manually select the cluster center, so in the relevant comments [2] The word "automatic" was questioned. Similar to Mean-shift, a clustering center is defined as the maximum local density point, and unlike Mean-shift, a cluster center is a specific sample point, and it is not necessary to explicitly solve a point with the largest local density for each sample point within the space defined by the kernel function. Given dataset \ (\mathcal{s}=\{x_i|x_i\in\mathbb{r}^n,i=1,\cdots,n\}\), calculates two quantization values for each sample point \ (x_i\): local density value \ (\rho_i\) Clustering \ (\delta_i\) of sample points with a higher distance density. The local density \ (\rho_i\) of \ (x_i\) is defined as: \begin{equation} \rho_i=\sum_{j=1}^n\chi (D_{ij}-d_c) \end{equation} where \ (d_c\) is the truncation distance ( Cutoff distance), is actually the domain search radius; \ (d_{ij}\) is the distance between \ (x_i\) and \ (x_j\); function \ (\chi (x) \) is defined as \begin{equation} \chi (x) =\begin{ Cases} 1,& \text{if \ (x<0\)};\\ 0,& otherwise. \end{cases} \end{equation According to the comments in this article [2], It is also valuable to find two measurements of density. The first one is described by a negative number of the mean of the distance between the sample point and the nearest \ (m\) neighbor, and the other is the Gaussian kernel function, which is more robust than a truncated distance metric. \begin{equation} \rho (x_i) =-\frac{1}{m}\sum_{j:j\in KNN (x_i)}d_{ij} \label{eq:avg_kernel} \end{equation} \begin{ Equation} \rho (x_i) =\sum_{j=1}^n\exp (-\frac{d_{ij}^2}{\sigma}) \label{eq:gauss_kernel} \end{equation} In fact, the above \ (\ \ rho_i\) is defined as the number of sample points between \ (x_i\) and the distance less than \ (d_c\). Distance \ (\delta_i\) metric \ (x_i\) and the distance from the nearest sample point, which is higher than its density, if \ (\rho_i\) is the maximum, then \ (\delta_i\) is the distance from the farthest sample from \ (x_i\): \begin{equation} \delta_i=\begin{cases} \ Underset{j:\rho_j>\rho_i}{\min} (D_{ij}), & \text{if \ (\exists j,\rho_j>\rho_i\)};\\ \underset{j}{\max} (d_ {ij}), & otherwise. \end{cases} \end{equation} for sample points with a density value of local or global maximum, their \ (\delta_i\) is much larger than the \ (\delta_j\) value of other sample points (5), Because the former represents the distance between the sample points with the largest local density, the latter represents the distance between the sample points and their corresponding local densities of the largest sample points. As a result, those sample points with large \ (\delta\) values are also likely to be cluster centers.
An example is given in this paper, shown in 6 (a), in which there are 28 sample points, and sample points are sorted in descending order of density. Roughly two clusters are observed, and the remaining 26, 27, and 28th sample points can be considered outliers. In Figure 6 (b), the decision graph (decision graph) is plotted for the horizontal ordinate of the most critical information in the cluster (\rho\) and \ (\delta\), and the sample points 1 and 10th are located in the upper right corner of the decision diagram. Sample points 9 and 10th, although the density value \ (\rho\) is very close, but the \ (\delta\) value is very large, the isolated 26, 27 and 28th sample points, although the \ (\delta\) value is large, but the \ (\rho\) value is small. In summary, only the \ (\rho\) value is high and \ (\delta\) a relatively large sample point will be the cluster center.
After finding the cluster center, the next step is to divide all the remaining points into the cluster of clusters that are more dense than the nearest sample point and, of course, will temporarily assign the noise point to the class cluster after that. In cluster analysis, the reliability of cluster allocation is often further analyzed. In Dbscan, only high-reliability sample points with higher densities than density thresholds are considered, but there is a case where the lower-density clusters are mistaken for noise. In this paper, the concept of boundary region is introduced for each cluster. The density value \ (\rho_b\) of the boundary area is computed based on the members that belong to the cluster and are less than \ (d_c\) from the sample points that belong to the other cluster. For all sample points in each class cluster, the density value is higher than \ (\rho_b\) as the core component of the class family (cluster core), and the remainder is considered a halo of that cluster (cluster Halo), and a noise point is included in the cluster glow. The results of a cluster are given in this paper, which is shown in 7.
Neighbor Search radius \ (d_c\) exactly how to value it? \ (d_c\) is obviously influenced by the clustering results, which we only need to consider in the two most extreme cases. If \ (d_c\) is too large, the density values of each data point are approximately equal, causing all data points to be divided into the same cluster, and if \ (d_c\) is too small, each class cluster contains very few sample points, and it is possible that the same class cluster is split into several parts. On the other hand, it is not possible to give a suitable \ (d_c\) for all datasets, depending on how dense the data points are in different data sets. The author suggests that the appropriate \ (d_c\) should make the average number of neighbors of the data points the proportion of the total data set size (\tau, (\tau=1\%\sim 2\%) \). As a result, the parameter \ (\tau\) is independent of the specific data set. For each data set, we can look for a more appropriate \ (d_c\). Combined with the MATLAB code given by the author, the detailed calculation method is as follows: Take out the symmetric distance matrix of the upper triangle of all the \ (M=n (N-1)/2\) elements, and then sort them in ascending order \ (D_1\leq d_2\leq \cdots\leq d_m\). In order to ensure that the average number of neighbors per data point is \ (\tau\), so long as the number of distances less than \ (d_c\) is the proportion of \ (\tau\) can be, so take \ (D_c=d_{round (\tau M)}\). How is the number of class clusters determined? The author gives the MATLAB code, the cluster center needs to be selected manually, many readers therefore questioned the article "it is able to detect nonspherical clusters and to automatically find the correct num ber of clusters ", is not a kind of being deceived feeling. However, the author also gives a simple choice of the number of clusters, although I also feel that the method has some problems, but the total return is given the solution. The two basic standpoint of the thesis explained above, the corresponding \ (\rho\) and \ (\delta\) of the cluster center are relatively large. The author introduces \ (\gamma_i=\rho_i\delta_i\) for each sample point \ (x_i\), and then arranges all of the \ (\gamma_i\) in descending order after it is shown in Figure 9 (a).if the \rho\ and \ (\delta\) are normalized first, it will be more reasonable, which will also make the weights of the two participating in the decision equal. Because if \ (\rho\) and \ (\delta\) are not an order of magnitude, the impact of the inevitable order of magnitude will be small.What do we do next? The author still does not give a specific solution. As a whole, \ (\gamma\) value in most cases is still very similar, the difference is that a few clustering centers, I think the anomaly check (Anomaly Detection) Point of view to find this jumping point. The simplest method is to construct a Gaussian distribution \ (\mathbb{n} (\mu,\sigma^2) \) based on the value of the neighboring \ (\gamma\), and the parameters of the Gaussian distribution can only be scanned two times \ (\gamma\) values according to the maximum likelihood parameter estimation method. So the model is still very efficient or very high. With this model, we scan \ (\gamma\) from the back, if we find that the cumulative probability of the left or right side of a value (8 of the blue area on the left and the sides) is less than the threshold value (for example, 0.005) when the anomaly is found, the number of clusters can be roughly determined at this point. To further learn how to use the Gaussian distribution for anomaly detection, see [1]. We all know the probability density function of the Gaussian distribution, but the cumulative distribution function of the Gaussian distribution (cumulative distribution function) does not have the expression of the elementary function, so what should be good? Find half-day data, and do not find how the numerical approximation of the principle of explanation, But the search for a Java-basedHart AlgorithmCode for approximate calculation of the cumulative distribution function of a standard normal distribution [4]. A few lines of Java code are done, but I don't understand for a moment why this is possible. I turned it into the following C + + code and then compared the output to the data in the Q function table [6] on Wikipedia (note \ (1-q (x) =\phi (x) \)) and found the result to be exactly the same as expected.
Double cdfofnormaldistribution (double x) {const double pi=3.1415926;double p0=220.2068679123761;double p1= 221.2135961699311;double p2=112.0792914978709;double p3=33.91286607838300;double p4=6.373962203531650;double p5=. 7003830644436881;double p6=.03326249659989109;double q0=440.4137358247552;double q1=793.8265125199484;double q2= 637.3336333788311;double q3=296.5642487796737;double q4=86.78073220294608;double q5=16.06417757920695;double q6= 1.755667163182642;double q7=0.08838834764831844;double cutoff=7.071;//10/sqrt (2) Double root2pi=2.506628274631001 ;//sqrt (2*PI) Double xabs=abs (x);d ouble res=0;if (x>37.0) Res=1.0;else if (x<-37.0) res=0.0;else{double expntl= Exp (-.5*xabs*xabs);d ouble pdf=expntl/root2pi;if (Xabs<cutoff) res=expntl* (((((((((P6*xabs + p5) *xabs + P4) *xabs + p3) * es=pdf/(xabs+1.0/(xabs+2.0/(xabs+3.0/(xabs+4.0/)));} if (x>=0.0) Res=1.0-res;return Res;}
Furthermore, the authors claim that the data corresponding to the random uniform distribution (\gamma\) obeys the Power-law distribution (power laws), but there is no such case for a dataset that really has a clustered center. Many phenomena are in fact approximate to obey the power law distribution, especially suitable for most events of small size but a few events on a large scale, but the author does not give the source of this conclusion, so the same point has been questioned by many readers. I guess at present only the author based on some experimental numbers summed up, can only be said to rely on the experience of incomplete statistics, there is no substantial theoretical basis. That is \ (\gamma\approx cr^{-k}+\epsilon\), where \ (r\) is the rank ordinal of \ (\gamma\), then \ (\log\gamma\) and \ (\log r\) should be approximately linear, as shown in 9 (b). If the author's guess is correct, we might as well send out a diagram such as \ (\log\gamma\) and \ (\log r\) before clustering to determine the complexity of clustering, or how reliable the results of clustering on that dataset are.
Finally, based on the idea of this article, I finally implemented a more complete clustering algorithm with C + + code and uploaded it to GitHub as my first repository on GitHub, please go to https://github.com/ jeromewang-github/cluster-science2014 download, you are welcome to find bugs and provide suggestions for changes!
References
- [1] Anomaly detection. Http://www.holehouse.org/mlclass/15_Anomaly_Detection.html.
- [2] Comments on clustering by fast search and find of density peaks. http://comments.sciencemag.org/content/10.1126/science.1242072.
- [3] Dbscan. Http://en.wikipedia.org/wiki/DBSCAN.
- [4] Hart algorithm for normal CDF. HTTP://WWW.ONEDIGIT.ORG/HOME/QUANTITATIVE-FINANCE/HART-ALGORITHM-FOR-NORMAL-CDF.
- [5] Mean-shift. Http://en.wikipedia.org/wiki/Mean-shift.
- [6] q-function. Http://en.wikipedia.org/wiki/Q-function.
- [7] Dorin Comaniciu and Peter Meer. Mean shift:a Robust approach toward feature space analysis. Pattern analysis and Machine Intelligence, ieee-transactions on, 24 (5): 603–619, 2002.
- [8] Dorin Comaniciu, Visvanathan Ramesh, and Peter Meer. The variable bandwidth mean shift and Data-driven scale selection. In Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on, Volume 1, Pages 438–445. IEEE, 2001.
- [9] Konstantinos G. Derpanis. Mean shift clustering. Http://www.cse.yorku.ca/~kosta/CompVis_Notes/mean_shift.pdf, 2005.
- Martin Ester, Hans-peter Kriegel, J. Org Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd, volume, Pages 226–231, 1996.
- Andrew Y Ng, Michael I Jordan, Yair Weiss, et al. On spectral clustering:analysis and an algorithm. Advances in neural information processing systems, 2:849–856, 2002.
- [Douglas] Reynolds. Gaussian mixture models. Encyclopedia of Biometrics, pages 659–663, 2009.
- [] Alex Rodriguez and Alessandro Laio. Clustering by fast search and find of density peaks. Science, 344 (6191): 1492–1496, 2014.
- [+] Aarti Singh. Spectral clustering. Https://www.cs.cmu.edu/~aarti/Class/10701/slides/Lecture21_2.pdf.
- [Ulrike] Von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17 (4): 395–416, 2007.
- [+] Jue Wang, Bo Thiesson, yingqing Xu, and Michael Cohen. Image and video Segmentation by anisotropic kernel mean shift. In Computer VISION-ECCV 2004, pages 238–249. Springer, 2004.
- [+] Rui Xu, Donald Wunsch, et al Survey of clustering algorithms. Neural Networks, IEEE transactions on, 16 (3): 645–678, 2005.
Clustering by density peaks and distance