The principle of spectral clustering algorithm to introduce __ spectral clustering

Source: Internet
Author: User
Tags dashed line
1. Spectral Clustering

Give your blog a number of blogs, let you divide them into k, what you will do. There must be a lot of methods, this article is to introduce one of them--spectral clustering.
The intuitive interpretation of clustering is to divide them into different groups according to the similarity between samples. The idea of spectral clustering is to treat a sample as a vertex, the similarity between the samples is regarded as a weighted edge, which turns the clustering problem into a graph segmentation problem: Finding a way to divide the edges of different groups as low as possible (which means that the similarity between groups is as low as possible), The edges in the group are weighted as high as possible (which means that the group has as high a similarity as possible). The example above is to use each blog as a vertex on the graph, and then connect the vertices according to the similarity, and finally split. The vertices that are connected together after the split are the same class. A more specific example is shown in the following illustration:

In the above figure, there are 6 vertices (blogs), the line between the vertices represents the similarity of two vertices, now to divide the graph into two (two classes), how to split (remove which side). According to the idea of spectral clustering, the edge that should be removed is the one dotted with a dashed line. Finally, the remaining two halves correspond to two classes.
According to this idea, we can get unnormalized spectral clustering and normalized spectral clustering, because the former is simpler than the latter, so this paper introduces several steps of unnormalized spectral clustering (suppose to be divided into K-Class):
(a) Establishing similarity graph and using W to represent the weighted adjacency matrix of similarity graph
(b) Compute unnormalized graph Laplacian Matrix L (l = d-w, where D is degree matrix)
(c) Calculation of the first k minimum eigenvector of L
(d) The K eigenvectors are arranged together to form a n*k matrix, each of which is considered a vector in the K-dimensional space and is clustered using the K-means algorithm.

2. Algorithm principle Analysis

This section mainly explains how the four steps of unnormalized spectrum clustering are derived, and does not involve the derivation of specific formulas.


(a) The idea of spectral clustering is to transform into a graph segmentation problem. Therefore, the first step is to convert the original problem into a graph. There are two problems to be solved in diagram conversion:

One is how to define the edges of two vertices and which edges to keep.
For the first question, if two points are somewhat similar, add an edge between the two points. The degree of similarity is represented by the weight of the side (the value above the edge of the image above is the weight). As a result, formulas that compute the similarity are available, but the Gaussian similarity function is commonly used.
For the second question, the reason to retain some of the edges is that there are too many edges to deal with, and too low a side is superfluous. The common method of preserving edges is to establish k-nearest neighbor graph. In this diagram, each vertex is only connected to a point with the highest similarity of K.

(b) The unnormalized graph Laplacian Matrix (hereafter expressed in L) has many very good properties, and for this reason, it is necessary to compute such a matrix in the second step. The most important nature is the following set of properties:

This group of properties will play a decisive role in the derivation of the formula later.

(c) After converting the original problem into a diagram, the next task is to decide how to divide it. The graph segmentation problem is actually the minimal cut problem (mincut problem). The minimum cut problem can be defined as minimizing the following objective functions:

where k is divided into K groups, AI represents the group I, which represents the complement of AI, W (a,b) represents the sum of the weights of all sides of Group A and group B.
The intuitive meaning of this formula: if you want to divide into K Group, then the price is the sum of the weights of the edges removed when the partition is made. Unfortunately, the direct minimization of this formula usually leads to bad segmentation. For example, in 2 classes, this equation usually divides the graph into two categories: one for a class and the other for the other. Obviously, such a division is very bad. Because we expect each class to have a reasonable size. So, to improve on this equation, the improved formula (called Ratiocut) is as follows:

which | A| represents the number of vertices contained in Group A.
In Ratiocut, if a group contains fewer vertices, its value becomes larger. In a minimized problem, this is tantamount to punishment, which is not encouraging the component to be too small. Now as long as the minimization of ratiocut solution, the segmentation is complete. Unfortunately, this is a NP-hard problem. If you want to solve it in polynomial time, you need to make a transformation of the problem. In the process of transformation, we use the same set of properties mentioned above, and after some derivation, we can finally get this question:


where h is a matrix, the definition of its element (Eq. (5)) is as follows:

If the element of the H-matrix is not 0, then the first point belongs to the J class. In other words, if you get the H matrix, you know how to split it. Unfortunately, the problem is still NP-hard. However, if we allow the elements of the H matrix to take arbitrary values , the problem becomes solvable in polynomial time, when the problem becomes:


According to Rayleigh-ritz theorem, the solution of this problem is the matrix H of the first K minimum eigenvector of L, in which the eigenvector is ranked by column, that is, each column of H is a eigenvector.

(d) In the third step, in order to loosen NP-hard problems, let the H-matrix take any value, therefore, the solution of the H-matrix no longer has the original property-element value can indicate which point belongs to which category. Still, it's easy for K-means to cluster each row of the H matrix as a dot. Therefore, the K-means of H-matrices is used as the final result of spectral clustering.

3. The realization of spectral clustering

The following is the unnormalized spectral clustering of the MATLAB version of the implementation of the code format of the blog Park is not incredibly matlab ... Select a C + + here):

function [C, L, D, Q, V] = Spectralclustering (W, k)
% spectral Clustering algorithm
% input:adjacency matrix W; Number of cluster k 
% return:cluster indicator vectors as columns in C; unnormalized Laplacian L; degree matrix d;
  %         eigenvectors matrix Q; eigenvalues matrix V

% calculate degree matrix
degs = SUM (W, 2);
D = Sparse (1:size (w, 1), 1:size (W, 2), degs);

% compute unnormalized Laplacian
L = d-w;

% compute the eigenvectors corresponding to the K smallest eigenvalues
% diagonal matrix V is Ncutl ' s K smallest magn Itude eigenvalues 
% matrix Q whose columns are the corresponding eigenvectors.
[Q, V] = Eigs (L, K, ' SA ');

% use of the K-means algorithm to cluster V row-wise
% C would be a n-by-1 matrix containing the cluster Data point
C = Kmeans (Q, k);

% convert C to a n-by-k matrix containing the k indicator vectors as columns
C = Sparse (1:size (D, 1), C, 1);

End

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.