Spectral Clustering algorithm

Last Update:2016-03-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reproduced from: "Clustering Algorithm" spectral clustering (spectral clustering)

1, problem description

Spectral clustering (spectral clustering, SC) is a clustering method based on graph theory--dividing the weighted non-direction graph into two or more optimal sub-graphs (sub-graph), so that the interior of the sub-chart is as similar as possible, and the distance between the sub-graph is as far as possible. To achieve the common purpose of clustering.

The relevant definitions for graphs are as follows:

For graph G = (v,e), V represents a collection of vertices, that is, a sample collection, a vertex is a sample, and E represents an edge collection.
Set the number of samples to n, that is, the number of vertices is n.
Weight matrix: W, The matrix of N*n, whose value is wi,j to the weights of each edge, representing the similarity between vertex i,j (samples). For any wi,j = Wj,i, wi,i=0, the element on the diagonal is 0.
Typically, two vertices with a similarity less than a certain threshold are not connected, otherwise the weights of the edges connecting the two vertices are the values of the similarity metric function of two samples.
Defines the matrix of the N*n : D, its line I, the element of column I (diagonally) is the sum of all the elements of the w I line, that is, the sum of the similarity of the I vertex to all other vertices.

The sum of the weights of the edges to be broken is the loss function when the graph G is divided into sub-figure g1,g2:

If you give a diagram of six samples: the corresponding loss function in this example is w1,5 + w3,4 = 0.3.

The goal of spectral clustering is to find a better partitioning criterion, which divides the entire sample space into individual sub-graphs (sub-graph), and a sub-graph is a category. According to the criteria of the segmented sub-graph, it can be divided into different spectral clusters (Minimum cut,Ratio cut and normalized cut , etc.).

Before the specific algorithm, review some of the linear algebra related conclusions, not clear can consult the relevant information:

Ax =λx, λ is the eigenvalues of a, and X is the eigenvector of the corresponding λ.
For a real symmetric matrix A, its eigenvectors are orthogonal. That is, when i≠j, <xiT,xj> = 0 (<,> denotes inner product).
For a positive definite matrix, all its eigenvalues are greater than 0, and for a semi-positive definite matrix, all its eigenvalues are greater than or equal to 0

2, problem transformation

First look at this loss function and make the following transformation:

1. Define QI as follows:

When vertex I belongs to the sub-figure G1, Qi = c1. When vertex I belongs to the sub-figure G2, Qi = c2.

2, Cut (G1,G2) Deformation:

When and only if i,j belongs to a different sub-graph, (QI-QJ) 2/(C1-C2) 2 = 1, otherwise (QI-QJ) 2/(C1-C2) 2 = 0. Constant 1/2: Iterate through J by each I, so that the weights of the clipped edges are counted two times, so divide by 2.

3, Cut (G1,G2) Molecular deformation:

4, laplace matrix L = D-w, meet:

5, Problem Transformation:

From the 3rd step, the equation is:

Therefore, summing up the above deduction, there are the following formula:

Because wi,j≥0, so qtlq for arbitrary q≠ 0, there are qtlq≥0, so L is a semi-positive matrix, the L is a real symmetric matrix. There are three properties as follows:

L All eigenvalues are ≥0, and the eigenvectors of eigenvalues correspond to orthogonal.
L has a characteristic value equal to 0, its corresponding eigenvector is [1]t,...,, the specific meaning of this value, described later.
All non-zero eigenvectors are orthogonal to the inner product of [1]t,...,.

1th in the article at the beginning of the conclusion to mention, do not elaborate, for the 2nd, let's take a good look at this L. For the original sample set of the article, there are the following matrices, which correspond to the w,d,l matrix.

For the vector λ0=[1,1,1,1,1,1]t always make, l*λ0 = 0 = 0*λ0, so 0 is always the eigenvalues of L, and 0 eigenvalues corresponding to the eigenvector is [,..., 1]t. The 2nd understanding, the 3rd also naturally can understand.

Therefore, the minimization of loss function cut (G1,G2) problem is converted to the minimization polynomial qtlq, but corresponding to the different criteria, its restrictive conditions are different, can use Ruili entropy (Rayleigh quotient) of the nature of the solution, next will be introduced.

3. Classification criteria

First, let's look at the optimization problem of the polynomial like QTLQ. Before you go, take a look at Rayleigh quotient (see Wikipedia), which is only part of the nature:

For Rayleigh quotient is defined as follows:

The minimum value for a given m,r (m,x) is λmin (the smallest eigenvalue of M), and if and only if x = vmin (for the corresponding eigenvector), the same, R (M,x) ≤λmax, and R (M,vmax) =λmax.

Using Lagrange multiplier method, the critical points (extremum point) problem of polynomial can be solved (specific process reference Rayleigh quotient:formulation using Lagrange multipliers):

For polynomial, s.t. Solve the extremum.
When the Lagrange multiplier is added, the derivative can obtain the MX =λx, that is, when x is the characteristic vector of M, R (m,x) obtains the extremum, and the upper formula can get the extremum of R (m,x) =λ, that is, the corresponding eigenvalue.

The final expression of our second verse is emphasized again for the later reading, which is written in equation (1):

3.1. MINIMUM CUT method

Minimum Cut's objective function is the formula (1), for the C1,C2 to take any number does not affect the classification results (of course, cannot be equal, because it is not possible to distinguish between equal things, C1 is the label of the G1 sample belongs to C2, the same G2 as the sample belongs to the label, the label is equal, Can not be distinguished), but it will affect the solution process : C1,C2 influence Ruili entropy to find out whether the conditions are satisfied, that is. To facilitate the solution, we choose the following,

When C1 =-C2 = 1 o'clock, i.e. Q is:

The solution to the minimized formula (1) becomes:

In the constraints, the first, can be taken by the vector q element value can only be 1 or-1, the second, mentioned above,e is the element is all 1 vector,e is the smallest eigenvector of L, L of all the eigenvectors orthogonal.

This problem solving method has been mentioned in the 3rd and 3.1, the optimal classification scheme q is the minimum eigenvalue of L corresponding to the eigenvector, l of the minimum eigenvalue 0 (that is, the minimum value of the objective function), the corresponding eigenvector is e. you can explain that you can find a scheme that causes the target function to be 0 (the sum of the clipped edge weights is 0): All samples belong to the G1 class (because Q at this time the value is all 1, corresponding to I∈G1), 0 samples belong to the G2 class. This is an always-present but meaningless classification. Therefore, it is discharged (i.e. the effect of the second limiting condition ).

To solve the above problems, we need to solve the characteristic vectors of the second small eigenvalues of L, and cluster the eigenvectors. At this point, the problem is changed: the solution of discrete problems into the solution of continuous problems (here, the problem is relaxed, so that the Np-hard problem becomes the P problem), and finally discretization.

Continuous problem: solving polynomial qtlq minimum = "Finding the eigenvalues and eigenvectors of L".
Discretization: The original Qi is: 1 belongs to G1,-1 belongs to G2. The last obtained q is not a discrete value in the originally defined Qi, and the size of the value is only an indication. can easily find a reasonable threshold, split the final Q, that Qi > 0 belongs to G1,qi < 0 belongs to G2.

Problem: Such an objective function ignores the existence of outliers, such as:

Wh,c < Wb,d + Wc,g, the clustering result is a class H, and all other points are one class. If it corresponds to 0.3 < 0.2 + 0.2, it will result in the smallest cut in the diagram, so the classification is obviously unreasonable, and we prefer the result of best cut. To avoid such an image, the ratio Cut method was introduced to make the class number relatively balanced.

3.2. RATIO CUT method

First look at the objective function formula for ratio Cut (2):

Where N1 is the number of vertices belonging to the G1, N2 the same. Correspondence analysis, if the smallest cut in the figure, then Rcut (g1,g2) = 0.3/1 + 0.3/7 = 0.34, the best cut in the figure, then Rcut (g1,g2) = (0.2+0.2)/4 + (0.2+0.2)/4 = 0.2, obviously avoids In this case, not only the weight of the cutting edge is considered, but also the equilibrium of the sample quantity in each category is considered.

To convert to Ruili problem, the QI is defined as follows:

Bring in the formula (2) for (don't forget n1+n2=n,n as constant):

At this point, the problem is transformed into the problem that the entropy of Ruili can solve:

Restrictions QTQ:

The next work in 3.1 is the same.

3.3. Normalized CUT method

None of the above methods takes into account the weighting coefficients inside the sub-graph. Normalized cut adds weights inside the child graph. The objective function is the following formula (3):

Where D1 is the sum of all edge weights in G1 plus cut (G1,G2), D2 is the sum of all edge weights in G2 plus cut (g1,g2), d = d1 + d2-cut (G1,G2). As shown: (D1=asso (A,a) +cut (A, B), D2=asso (b,b) +cut (A, B))

To transform the problem, define Qi as follows:

The entry formula (3) is:

The problem turns into (in fact this is a generalization Rayleigh quotient model):

Where the restrictions are:

The solution here is still the derivation of the objective function with the first limiting condition and the Lagrange multiplier, and a D matrix is added to the different restriction conditions, which is slightly different from the previous result (Mx =λx).

Step1:

Step2:

Step3:

STEP4:

At this point, the normalized Laplace matrix (normalized Laplacian, the diagonal element is all 1) l ' = d-1/2 l D-1/2 eigenvalues and their corresponding eigenvectors can be obtained. Since the eigenvalues of L and L ' are the same, the relationship of eigenvectors is q ' = d1/2q, so the eigenvector corresponding to the eigenvalues of L ' can be obtained, and then the q is calculated by multiplying the d-1/2. (the characteristic vectors mentioned above are the eigenvectors of the second small eigenvalues.)

4. Summary

The above mentioned are clustering for two kinds of cases, when using spectral clustering for K-clustering is, you can select except the eigenvalues of 0, the first k small eigenvalues corresponding to the eigenvector (size n*1), a feature matrix (the size of the n*k matrix with the feature vector column). In a matrix, the line vector is the feature space representation of the corresponding sample in the row. Finally, we use other clustering algorithms such as K-means to cluster.

According to the different classification criteria, spectral clustering can be divided into two types: unnormalized spectral clustering & normalized spectral clustering, the difference is whether the Laplacian matrix is normalized, Ratio Cut & Minimum cut are all unnormalized.

1. unnormalized Spectral Clustering algorithm

Algorithm input: Sample similarity matrix S and number of classes to be clustered K.

A weight matrix W and a triangular matrix D are established according to the Matrix S;
Establish Laplacian matrix L;
Finding the K-small eigenvalues and their corresponding eigenvector of the Matrix L (except 0);
A new matrix is formed with this K-group eigenvector, the number of rows is the number of samples, the number of columns is k, here is to do a reduced-dimensional operation, from n-dimensional to K-dimensional;
Using other clustering algorithms such as K-means, the K-cluster are obtained.

2. Normalized Spectral Clustering algorithm

Algorithm input: Sample similarity matrix S and number of classes to be clustered K.

A weight matrix W and a triangular matrix D are established according to the Matrix S;
Establish Laplacian matrix L and l ' = d-1/2 l D-1/2;
The first k small eigenvalues and their corresponding eigenvectors are obtained for The Matrix L ' (except 0);
Using Q ' = d1/2q to obtain the corresponding k Q; (q is not a characteristic vector of L)
A new matrix is formed with this K-group Eigenvector, whose number of rows is the number of samples N and the number of columns is k;
Using other clustering algorithms such as K-means, the K-cluster are obtained.

The various stages of the spectral clustering are:

Select the appropriate similarity function to compute the similarity matrix to build the weight matrix W;

The eigenvalues and eigenvectors of the matrix are computed, for example, the Lanczos iterative algorithm can be used.
How to choose K, you can use heuristic method, for example, found that the 1th to M of the eigenvalues are very small, to the m+1 suddenly become a larger number, then you can choose K=m;
Using the K-means algorithm clustering, of course, it is not the only choice;
Normalized spectral clustering is preferred in that it is better to have the least similarity between cluster and the largest internal similarity of cluster.

Spectral Clustering Performance:

Better than traditional K-means, spectral clustering is using the elements of eigenvectors to represent the original data and K-means in this "better representation", a "better representation" of Laplacian Eigenmap The result of a reduced dimension.
Computational complexity is smaller than k-means. This is especially true for high-dimensional data. For example, text data, usually arranged in a very high dimension (for example, thousands of or tens of thousands of) sparse matrix, sparse matrix to find eigenvalues and eigenvectors have a very efficient way, the result is some k-dimensional vectors (usually k is not very large), in these low-dimensional data to do k-means computation is very small. But for the raw data directly to do K-means, although the initial data is sparse matrix, but there is a k-means in the centroid operation, is to find a mean: Many sparse vectors are not necessarily sparse vector, in fact, in the text data, In many cases the centroid vector is very dense, when the distance between the calculation vectors, the computation becomes very large, directly leading to the ordinary K-means giant slow, and spectral clustering and other processes more rapid results.

Spectral Clustering algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Spectral Clustering algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Spectral Clustering algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support