Spectral clustering)

Source: Internet
Author: User
1. spectral clustering

I will give you several blogs in the blog Park and let you divide them into k categories. What will you do? There are many ways to do this. This article will introduce one of them, spectral clustering.
The intuitive interpretation of clustering is to divide them into different groups based on the similarity between samples. The idea of spectral clustering is to regard samples as vertices, And the similarity between samples as weighted edges, so as to convert the clustering problem into graph segmentation problem: find a graph segmentation method to minimize the Edge Weight (which means the similarity between groups is as low as possible ), the edge weight in the group is as high as possible (this means that the group similarity is as high as possible ). The above example is to use each blog as a vertex on the graph, connect these vertices based on the similarity, and finally split them. The vertex that is connected after the split is the same class. For more specific examples, see:

There are a total of six vertices (blog) in the middle. The line between the vertices indicates the similarity between the two vertices. Now we want to divide the graph into two halves (two classes ), how to split (which side is removed )? According to the idea of spectral clustering, the edges to be removed are represented by dotted lines. Finally, the remaining two halves correspond to two classes respectively.
Based on this idea, we can get the unnormalized and normalized spectral clustering. because the former is simpler than the latter, this article introduces several steps of unnormalized spectral clustering (assuming K classes are involved ):
(A) Create a similarity graph and use W to represent the weighted neighbor matrix of the similarity graph.
(B) Calculate unnormalized graph Laplacian matrix L (L = D-W, where D is degree matrix)
(C) Calculate the first K smallest feature vectors of L.
(D) Arrange these K feature vectors together to form a n * k matrix, and treat each row as a vector in a K-dimensional space, and use the K-means algorithm for clustering.

 

2. Analysis of algorithm principles

This section mainly explains how the four steps of unnormalized spectral clustering come from without specific formula derivation.
(A) The idea of spectral clustering is to convert it into graph segmentation. Therefore, the first step is to convert the original problem into a graph. There are two problems to solve when converting to a graph: one is how to define the edges of two vertices, and the other is what edges should be retained.
For the first problem, if the two points are similar to each other to a certain extent, an edge is added between the two points. The degree of similarity is represented by the edge weight (the value on the edge is the weight ). Therefore, as long as the similarity calculation formula is available, but the commonly used is Gaussian similarity function
The reasons for retaining some edges are: too many edges are difficult to handle, and too low-weight edges are redundant. The common method of edge retention is to create a K-nearest neighbor graph. In this graph, each vertex is connected to only k vertices with the highest similarity.

 

(B) unnormalized graph Laplacian Matrix (represented in L below) has many good properties. That is why such a matrix needs to be computed in step 2. The most important property is the following:

This group of properties will play a decisive role in the subsequent formula derivation.

 

(C) After converting the original problem into a graph, the next step is to determine how to split it. The graph splitting problem is actually the mincut problem ). The minimum cut problem can be definedMinimizeThe following target functions:

K indicates dividing into k groups, AI indicates the I-th group, indicating the completion set of AI, w (A, B) the sum of the weights of all edges between group A and group B.
This formula has an intuitive meaning: If you want to divide the edge into k groups, the cost is the sum of the edge weights removed during splitting. Unfortunately, directly minimizing this substatement usually leads to poor segmentation. Take the following two types as an example. This type of sub-statement usually divides a graph into two types: one vertex is a class, and the rest vertices are another class. Obviously, this split is very bad. Because we expect each class to have a reasonable size. Therefore, to improve this formula, the improved formula (ratiocut) is as follows:

| A | indicates the number of vertices contained in group.
In ratiocut, if a group contains fewer vertices, the larger the value is. In a minimization problem, this is equivalent to punishment, that is, it does not encourage the Group to be too small. Now we only need to extract the minimum ratiocut, and the split is complete. Unfortunately, this is an NP-hard problem. To solve this problem within the polynomial time, we need to make a transformation. In the process of conversion, the set of properties of l mentioned above are used. After some derivation, the following problem can be obtained:


H is a matrix, and its Element Definition (Eq. (5) is as follows:

If the element of the H matrix is not 0, the I point belongs to the J class. That is to say, as long as the H matrix is obtained, we can know how to split it. Unfortunately, this problem is still NP-hard. However, if weAllows the element of the H matrix to take any valueThis problem becomes solvable in polynomial time, and the problem becomes:


According to the Laplace-Ritz theorem, the solution to this problem is the matrix H composed of the first K smallest feature vectors of L. The feature vectors are arranged by columns, that is, each column of H, all are a feature vector.

 

(D) in step 3, let's take any value from the H matrix to solve the NP difficulty problem. Therefore, the H matrix is no longer of the original nature-the element value can indicate which point belongs to which type. Even so, for K-means, clustering every row of the H matrix as a vertex is quite easy. Therefore, K-means is used to clustering the H matrix as the final result of spectral clustering.

 

3. Implementation of spectral clustering

The following is the implementation of the MATLAB version of unnormalized spectral clustering (there is no MATLAB in the selection of the blog garden code format... Select C ++ ):

function [ C, L, D, Q, V ] = SpectralClustering(W, k)% spectral clustering algorithm% input: adjacency matrix W; number of cluster k % return: cluster indicator vectors as columns in C; unnormalized Laplacian L; degree matrix D;%         eigenvectors matrix Q; eigenvalues matrix V% calculate degree matrixdegs = sum(W, 2);D = sparse(1:size(W, 1), 1:size(W, 2), degs);% compute unnormalized LaplacianL = D - W;% compute the eigenvectors corresponding to the k smallest eigenvalues% diagonal matrix V is NcutL's k smallest magnitude eigenvalues % matrix Q whose columns are the corresponding eigenvectors.[Q, V] = eigs(L, k, 'SA');% use the k-means algorithm to cluster V row-wise% C will be a n-by-1 matrix containing the cluster number for each data pointC = kmeans(Q, k);% convert C to a n-by-k matrix containing the k indicator vectors as columnsC = sparse(1:size(D, 1), C, 1);end

 

4. Related Materials

If you want to better understand spectral clustering, [1] is strongly recommended. If you want to fully understand clustering, [2] is strongly recommended.
[1] A tutorial on spectral clustering
[2] talking about the clustering Series


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.