Reprint please indicate source: http://www.cnblogs.com/tiaozistudy/p/twostep_cluster_algorithm.html
The two-step clustering algorithm is a kind of clustering algorithm used in SPSS Modeler, and it is an improved version of Birch hierarchical clustering algorithm. It can be applied to the clustering of mixed attribute datasets, and the mechanism of automatically determining the optimal number of clusters is added, which makes the method more practical. This paper, based on the study literature [1] and "IBM SPSS Modeler Algorithms Guide", incorporates its own understanding and describes in more detail the process and details of the two-step clustering algorithm. Before reading this article, we need to learn birch hierarchical clustering algorithm and logarithmic likelihood distance.
A two-step clustering algorithm, as the name implies, is divided into two stages:
1) Pre-cluster (pre-clustering) stage. Using the idea of CF tree growth in the birch algorithm, the data points in the data set are read one by one, and the data points of clustered dense regions are formed while the CF tree is generated, and many small sub-clusters (sub-cluster) are created.
2) cluster (clustering) stage. With the result of the pre-clustering phase-the sub-cluster as the object, using the condensation method (Agglomerative hierarchical Clustering method), the sub-clusters are merged one by one until the desired number of clusters.
The key technology of two-step clustering algorithm:
Figure 1: Key technologies and processes for two-step clustering algorithm
Set data set $\mathfrak D $ in $n $ data Object/data point $\{\vec x_n:n=1,..., n\} $, each data object is characterized by $d $ attribute, which has $d_1 $ continuous type attribute (continuous attribute) and $d_2 $ A categorical attribute (categorical attribute), set $\vec X_n = (\tilde x_{n1}, ..., \tilde x_{nd_1}, \ddot x_{n1},..., \ddot x_{nd_2}) $, where $\ Tilde X_{ns} $ represents the property value of the $n $ data object under the $s $ continuous property, $\ddot X_{nt} $ represents the value of the property under the $n $ data object under the $ classification property, which is known as the $t $ classification attribute that has a possible value of. $\MATHBF C_j = \{c_1,..., c_j\} $ represents a cluster of $\mathfrak $ for a dataset $j D $, where $c_j $ represents a cluster $\MATHBF C_j $ in $j $ of a cluster, without losing its generality, set a cluster $c_j $ in $n_j $ Data object, $ \{\vec x_{jn}: N=1,..., n_j\} $.
1. Pre-cluster stage
At this stage, the data set $\mathfrak D $ data points are inserted into the cluster feature tree (cf tree) to realize the growth of the CF tree, and when the size of the CF tree exceeds the set size, the potential outliers on the current CF tree are eliminated. Then the space threshold is increased and the CF tree is thin (rebuilding), and then the outliers of the CF tree volume without increasing the weight are inserted into the CF tree; When all data points are traversed, the potential outliers in the CF tree cannot be inserted into the true outliers; Finally, the final cf leaf element (leaf Entry) The clustering feature of the corresponding sub-cluster is output to the next stage of the algorithm. The process of this phase is as follows:
Figure 2: Pre-cluster stage process
1.1 Clustering Characteristics
First, the cluster feature (cluster feature) that defines the cluster $\mathfrak D $ (or a collection of any number of data points) is the same as the Birch algorithm:
Definition 1:Cluster $c_j $ clustering feature $\vec {Cf}_j $ defined as four tuples: $ \vec {cf}_j = \langle N_j, \vec {\tilde \lambda}_j, \vec {\tilde \sigma}_j, \vec {\ddot N}_j \rangle $, where $$ \begin{equation} \vec {\tilde \lambda}_j = (\tilde \lambda_{j1}, ..., \tilde \Lambda_{jD _1}) ^t = \left (\sum_{n=1}^{n_j} \tilde x_{jn1}, ..., \sum_{n=1}^{n_j} \tilde x_{jnd_1} \right) ^t \end{equation} $$ means each continuous genus $$ \begin{equation} \vec {\tilde \sigma}_j = (\tilde \sigma_{j1},..., \tilde \sigma_{jd_1}) ^ $c_j The linear sum of the data point attribute values in the cluster T = \left (\sum_{n=1}^{n_j} \tilde x_{jn1}^2,..., \sum_{n=1}^{n_j} \tilde x_{jnd_1}^2 \right) ^t \end{equation} $$ denotes continuous genus $c_j the sum of the squares of the data point attribute values in the cluster, $$ \begin{equation} \vec {\ddot n}_j = \left ((\vec {\ddot n}_{j1}) ^t, ..., (\vec {\ddot n}_{jd_2}) ^ T \right) ^t \end{equation} $$ type $\vec {\ddot N}_{jt} = (\ddot n_{jt1}, ..., \ddot n_{j,t,\epsilon_t-1}) ^t $ for cluster $c_j $ in $t $ A vector of the number of data points that may be valued under a categorical attribute, i.e. $\ddot n_{jtk} \; (K=1,..., \epsilon_t-1) $ represents the number of data points in a cluster $c_j $ in the $k $ value under the $t $ classification property, because $\ddot n_{jt\epsilon_t} = N_j-\sum_{k=1}^{\epsilon_t-1} \ddot N_{JTK} $, so the dimension for vector $\vec {\ddot N}_{JT} $ is $\epsilon_t-1 $, vector $\vec {\ddot N}_j $ dimension is $\sum_{t=1}^{d_2 } (\epsilon_t-1) $.
Similar to the clustering feature in the birch algorithm, the clustering feature in the above definition is also additive:
theorem 1: cluster $c_j $ with cluster characteristics of $ \vec {Cf}_j = \langle N_j, \vec {\tilde \lambda}_j, \vec {\tilde \sigma}_j, \vec {\ddot n}_ J \rangle $, another cluster $c_{j '} $, its clustering feature is $ \vec {cf}_{j '} = \langle n_{j '}, \vec {\tilde \lambda}_{j '}, \vec {\tilde \sigma}_{j '}, \vec {\ddot n}_{j '} \rangle $, the cluster $c_j $ and $c_{j '} $ are merged into a large cluster $c_{\langle j,j ' \rangle} $, its clustering feature is $$ \begin{equation} \vec {cf}_{\l Angle J,j ' \rangle} = \langle N_j + n_{j '}, \vec {\tilde \lambda}_j + \vec {\tilde \lambda}_{j '}, \vec {\tilde \Sigma}_j + \vec {\tilde \sigma}_{j '}, \vec {\ddot N}_j + \vec {\ddot n}_{j '} \rangle \end{equation} $$
From the above definition, the clustering feature of the two-step clustering algorithm is different from the clustering feature defined in the Birch algorithm, but the function is the same: the divergence between clusters and clusters can be expressed by the statistics in the clustering feature, so it can be used to compress the data in the process of two-step clustering algorithm, Reduce the memory consumed by the algorithm. The difference is that common distances such as Okirid distance and Manhattan distance are used in the birch algorithm, and the two-step clustering algorithm uses a logarithmic likelihood distance in order to meet the requirements of processing mixed attributes. For cluster $c_j $, define parameters $$ \begin{equation} \zeta_j =-n_j \left (\frac12 \sum_{s=1}^{d_1} \ln (\hat \sigma^2_{js} + \hat \sigma_s^ 2) + \sum_{t=1}^{d_2} \hat e_{jt} \right) \end{equation} $$ where, $ \hat \sigma^2_{js} = 1/{n_j} \sum_{n=1}^{n_j} (\tilde x_{ JNS}-\bar {\tilde X}_{js}) ^2 $ represents the variance under the $c_j $ continuous attribute based on the data points in the cluster $s $ ($\bar{\tilde x}_{js} = 1/n_j \sum_{n=1}^{n_j} \tilde X _{JNS} $), $\hat \sigma_s^2 = 1/n \sum_{n=1}^n (\tilde x_{ns}-\bar{\tilde x}_s) ^2 $ represents the first $\mathfrak $ that is estimated from all data points in the dataset $s D $ The variance under the continuation attribute can be evaluated once in the entire algorithm flow, as a constant, $\hat e_{jt} =-\sum_{k=1}^{\epsilon_t} \ddot n_{jtk}/n_j \ln (\ddot n_{jtk}/n_j) $ denotes cluster $c_ J $ The information entropy under the type attribute on the $t $.
easy know, $$ \begin{equation} \begin{split} \hat \sigma^2_{js} & = \frac 1{n_j} \sum_{n=1}^{n_j} (\tilde x _{JNS}-\bar{\tilde x}_{js}) ^2 \ & = \frac 1{n_j} \left [\sum_{n=1}^{n_j} \tilde x_{jns}^2-\frac 1{n_j} \left (\ Sum_{n=1}^{n_j} \tilde X_{jns} \right) ^2 \right] \ \ & = \frac {\tilde \sigma_{js}}{n_j}-\left (\frac {\tilde \Lam Bda_{js}}{n_j} \right) ^2 \end{split} \end{equation} $$$$\begin {equation} \begin{split} \hat E_{jt} & =-\sum_{k=1}^{ \epsilon_t} \left (\frac {\ddot N_{jtk}}{n_j} \ln {\frac {\ddot N_{jtk}}{n_j}} \right) \ \ & =-\sum_{k=1}^{\epsilon_ T-1} \left (\frac {\ddot N_{jtk}}{n_j} \ln {\frac {\ddot N_{jtk}}{n_j}} \right)-\frac {\ddot N_{jt\epsilon_t}}{n_j} \l n {\frac {\ddot N_{jt\epsilon_t}}{n_j}} \ \ & =-\frac{\vec {\ddot N}_{jt}^t}{n_j} \cdot \ln{\frac{\vec {\ddot N}_{JT} ^t}{n_j}}-\left (1-\frac{\vec 1^t \vec {\ddot N}_{jt}^t}{n_j} \right) \cdot \ln {\left (1-\frac{\vec 1^T \vec {\d Dot N}_{jt}^t}{n_j} \rigHT)} \end{split} \end{equation} $$ where $\vec 1 = (,..., 1) ^t $ for all 1 vectors. So the argument in formula (5) $\zeta_j $ can be fully clustered by cluster $c_j $ \vec {cf}_j = \langle N_j, \vec {\tilde \lambda}_j, \vec {\tilde \sigma}_j, \ VEC {\ddot N}_j \rangle $ calculation is obtained.
Because the logarithmic likelihood distance between cluster $c_j $ and Cluster $c_{j '} $ is defined as follows: $$ \begin{equation} d (C_j, C_{j '}) = \zeta_j + \zeta_{j '}-\zeta_{\langle j,j ' \rangle} \end{equation} $$ So when we know the clustering feature of any two clusters, the logarithmic likelihood distance can be calculated directly according to the clustering feature.
1.2 Clustering feature tree (cf tree)
The CF tree in the two-step clustering algorithm is almost identical to the CF tree in the birch algorithm, and the growth and slimming process can be fully referenced by the Birch algorithm, this section mainly describes the uniqueness of the two-step clustering algorithm, which is different from the birch algorithm. It is suggested that the relevant knowledge points of birch hierarchical clustering algorithm should be studied.
1) The clustering feature of each node element on the CF tree with two-step clustering algorithm is defined as the clustering feature in this paper, which is different from the birch algorithm.
2) The three parameters of the CF tree of the two-step clustering algorithm are: Branch balance factor $\beta $, leaf balance factor $\lambda $ and threshold value $\tau $; The first two parameters are the same as the Birch algorithm, because of the mixed attribute, It is difficult to define the degree of scatter of clusters (in the Birch algorithm there is a measure of radius and diameter), so the two-step clustering algorithm has a new idea: since the scatter and spatial thresholds in the birch algorithm are only used as a basis for judging whether a cluster or a data point can be absorbed by a leaf element on a cf tree, the cluster $c_{ J '} $ (the cluster may be a cluster of singular strongholds) and the distance between the cluster $c_j $ to determine whether $c_{j '} $ can be $c_j $ absorbed, if $d (C_{j}, C_{j '}) \le \tau $, then $c_{j '} $ can be absorbed by $c_j $.
1.3 Outlier Processing (outlier-handling)
The required option for the outlier processing of a non-two-step clustering algorithm. Once you have identified the need to implement outlier processing during the growth of the CF tree, you will gradually filter out potential outliers and reinsert the missing outliers into the CF tree.
1) screening of potential outlier points. Before the CF tree is implemented to be thin (rebuilding), the potential outliers are screened according to the number of data points in the Ye Yuan term, and the number of the leaf elements contains the first amount in the corresponding clustering feature ($N _j $); First, find the tuple containing the most data points from all the leaf elements in the current CF tree, and record the number ($N _{\max} $) that the tuple contains, based on the predetermined scale parameter $\alpha \in [0,1] $; If a leaf element contains fewer data points than $\alpha n_{\ Max} $, the leaf element is placed as a potential outlier and removed from the current CF tree.
2) The insertion of a false outlier. After the CF Tree slimming (rebuilding) is finished, the potential outliers are processed one by one, if it can be absorbed into the CF tree without increasing the volume of the current CF tree, it is considered that the potential outlier point is a false outlier and will be inserted into the current CF tree.
After all data points in the data set $\mathfrak D $ are inserted into the CF tree, the meta-items that are still potential outliers are considered final outliers.
2. Cluster stage
The stage input is a sub-cluster (sub-cluster) of the leaf element of the final CF tree of the pre-cluster stage output, recorded as $c_1,..., C_{j_0} $, in fact, not a sub-cluster that contains a specific data point, but a clustering feature of each cluster: $\vec {cf}_1,..., \vec { CF}_{J_0} $. So this phase of the work is based on the input data $\vec {cf}_1,..., \vec {cf}_{j_0} $, sub-cluster $c_1,..., C_{J_0} $ two clustering, and finally achieve the cluster result of the desired cluster number.
The two-step clustering algorithm uses the idea of condensation (agglomerative hierarchical clustering method) to recursively merge the nearest clusters to achieve the above purpose. First from $j_0 $ $c_1,..., c_{j_0} $ to find the nearest two subgroups, merge them into a single cluster, complete the first step, the number of clusters in the new cluster is $j_0-1 $, and then merge the remaining clusters from the nearest cluster, and then implement the operation repeatedly. Until all the sub-clusters are combined into a large cluster, the clusters with cluster number 1 are obtained, so the clustering can be obtained $\MATHBF c_{j_0}, ..., \mathbf c_2, \MATHBF c_1 $; Finally, from this $j_0 $ cluster to output the desired cluster number, For example, the expected number of clusters is 5 o'clock and the output is $\MATHBF c_5 $.
Because these thoughts only involve the distance between clusters, it is possible to reach the target based on the clustering feature and logarithmic likelihood distance.
2.1 Automatically determine the best cluster count (Automatic determination of number of Clusters)
One of the characteristics of the two-step clustering algorithm is to automatically determine the optimal cluster number of clusters. The two-step clustering algorithm uses two strokes: coustic and fine-tuning, which achieves the accurate determination of the optimal cluster number.
1) coustic mainly uses Bayesian information guidelines (BIC, Bayesian information ciriterion) to find the approximate range of optimal clusters.
The Bayesian information code, also known as Schwarz's information criterion, is made up of G. SCHWARZ[2] Proposed guidelines for model evaluation.
Suppose there is a "$ model $m_1,..., M_r $, model $m_i $ by probability function $f_i (x; \theta_i) $ draw, where $\theta_i $ 's parameter space is $k_i $ dimension, i.e. $\theta_i \in \theta_i \subset \ma THBB R^{k_i} $, existing $n $ observations $x_1,..., X_n $, how to select the optimal model for the observed values, is the purpose of Schwarz to present the BIC. The BIC is calculated as follows: $$ \begin{equation} \mathrm{bic} (m_i) = 2 \ln f_i (x; \hat \theta_i) + k_i \ln n \end{equation} $$ type $\hat \thet A_i $ is a maximum likelihood estimate for $f_i (X;\theta) $. For the above model, the minimum BIC value calculated by the formula is the optimal model that conforms to the above $n $ observations.
in accordance with (9) $\MATHBF C_j = \{c_1,..., c_j\} $:$$ \begin{equation} \MATHRM {BIC} (\mathbf C_j) =-2 L_{\MATHBF c_j} + K \ln N \ End{equation} $$ According to the "Logarithmic likelihood distance" article (16), $L _{\MATHBF C_j} $ for the cluster to divide the log likelihood value, and have $$ l_{\mathbf c_j} = \sum_{j=1}^j L_{c_j} = \sum_{j=1}^ J \zeta_j $$ $\zeta_j $ see above (5) formula, because each continuous type attribute has the expectation and the variance 2 parameters, each classification type attribute has the number of values $\epsilon_t-1 $ parameters, so the above-mentioned $$ K = J \left (2 D_1 + \sum_{ T=1}^{d_2} (\epsilon_t-1) \right) $$ $N $ represents the total number of data points contained in $\MATHBF C_j $.
All clustering $\mathbf c_{j_0}, ..., \mathbf c_2, \MATHBF c_1 $, substituting into the (10) formula, the BIC value of all clusters, $\MATHRM {bic} (\MATHBF c_1), ..., \ma THRM {bic} (\MATHBF C_{j_0}) $, and then based on the difference of the adjacent cluster BIC value: $$ \begin{equation} \DELTA_{\MATHRM {BIC}} (J) = \mathrm{bic} (\MATHBF C_j) -\mathrm{bic} (\MATHBF c_{j+1}), \quad j=1,..., j_0 \end{equation} $$ calculates the change rate of the BIC value: $$ \begin{equation} r_1 (J) = \frac{\delt A_{\mathrm{bic}} (J)}{\delta_{\mathrm{bic}} (1)} \end{equation} $$ based on (12) a preliminary estimate of the optimal number of clusters: if $\delta_{\mathrm{bic}} (1) < 0 $, the optimal number of clusters is set to 1, omit the follow-up "refinement" step; otherwise, find out all $r_1 (J) < 0.04 $ $J $, set the minimum $j $ for the initial estimate of the optimal number of clusters, i.e. $ j_i = \min \{J \in \{1,..., j_0 \}: r_1 (J) < 0.04 \} $.
2) fine-tuning is the initial estimate of the optimal number of clusters $j_i $ start, based on the ratio of the nearest cluster distance in the two clusters, the optimal number of clusters is precisely located.
First there is a cluster $\MATHBF C_j = \{c_1,..., c_j\} $, the distance to define the cluster $\MATHBF C_j $ in the nearest cluster is: $d _{\min} (\mathbf c_j) = \min \{D (C_j, C_{j '}): C_j \ n EQ c_{j '} \in \mathbf c_j \} $, where $d (C_j, C_{j '}) $ is a logarithmic-likelihood distance defined by (8).
The ratio of the nearest cluster distance of the cluster $\MATHBF C_j $ and $\MATHBF c_{j+1} $ is defined as: $$ \begin{equation} r_2 (J) = \frac{d_{\min} (\MATHBF C_j)}{d_{\min} (\ MATHBF c_{j+1})}, \quad J = j_i,..., 2 \end{equation} $$ from the definition of $r_2 (J) $, according to the clustering process of condensation method, from the cluster $\MATHBF c_{j+1} $ to $\MATHBF C_j $ is by the The former middle distance of the most recent two clusters obtained, so generally have $d_{\min} (\MATHBF c_{j+1}) \le D_{\min} (\MATHBF C_j) $, or generally have $r_2 (J) \ge 1 $.
From the ratio of all recent cluster distances $\{r_2 (j): j=2,..., j_i \} $ to find the largest two, record the corresponding cluster number $j_1 = \max_{j\in \{2,..., J_i\}} \{r_2 (j) \} $ and $j_2 = \max_{j\in \{ 2,..., j_i\} \setminus \{j_1\}} \{r_2 (J) \} $, and then filter out the optimal number of clusters from $j_1 $ and $j_2 $: $$ \begin{equation} j^* = \begin{cases} j_1, & \ t EXT{IF} j_1/j_2 > 1.15 \ \max \{j_1, j_2 \}, & \text{otherwise} \end{cases} \end{equation} $$ This completes the optimal number of clusters $j^* $ OK, output $\MATHBF c_{j^*} $ as the final clustering result.
3, cluster data point allocation (Cluster membership Assignment)
After the above calculation, obtained the final cluster $\MATHBF c_{j^*} = \{c_1, c_2,..., c_{j^*} \} $, but do not know that the clusters in the cluster are specifically contained those points, only know the cluster characteristics of the clusters, so this step needs to complete the dataset $\ Mathfrak the allocation of the data points in D $ to the corresponding cluster.
If the entire algorithm is not in the process of outlier processing operations, the dispatch is very simple, just a data set $\mathfrak D $ $\vec x_i $ as a single point of the cluster, by calculating the data points and clustering $\MATHBF c_{j^*} The log likelihood distance of the cluster $d (\{\vec x_i\}, C_j) $, put the data point in the nearest cluster.
However, if outliers are considered during the implementation of the algorithm, set the threshold $$ \begin{equation} \tau_d = \sum_{s=1}^{d_1} \ln \rho_s + \sum_{t=1}^{d_2} \ln \epsilon_t \end{e Quation} $$ where $\rho_s $ represents the value range of the $s $ continuous property, $\epsilon_t $ represents the number of values for the $t $ classification attribute. For data point $\vec x_i $, if $d (\{\vec x_i\}, c_{j^*}) = \min_{c_j \in \mathbf c_{j^*}} \{D (\{\vec x_i\}, C_j) \} $ and $d (\{\vec x_i \}, c_{j^*}) < \TAU_D $, $\vec x_i $ is dispatched to the cluster $ c_{j^*} $; otherwise the data points $\vec x_i $ are outliers.
Reference Documents
[1] Chiu T, Fang D P, Chen J, et al. A robust and scalable clustering algorithm for mixed type attributes in large database environment[c]//ACM SIGKDD Interna tional Conference on Knowledge Discovery and Data Mining. 2001:263-268.
[2] Schwarz G. Estimating the Dimension of a model[j]. Annals of Statistics, 1978, 6 (2): pp. 15-18.
Tiaozi Study notes: Two-step clustering algorithm (Twostep Cluster algorithm)-Improved birch algorithm