Talk about clustering (4): spectral clustering <reprint>

Source: Internet
Author: User

Transferred from http://blog.pluskid.org /? P = 287

If K-means and GMM clustering methods are popular algorithms in ancient times, the spectral clustering mentioned this time can be regarded as a popular modern algorithm, the Chinese language is usually called "spectral clustering ". Due to the nuances of the matrix used, spectral clustering is actually a "class" algorithm.

Spectral clustering has many advantages over traditional clustering methods (such as K-means:

  • Similar to K-medoids, spectral clustering only requires a similarity matrix between data, and does not require data to be vectors in the n-dimensional Euclidean space as K-means does.
  • Because the main contradiction is grasped and the secondary things are ignored, it is more robust than traditional clustering algorithms. It is not so sensitive to irregular error data and has better performance. Many experiments have proved this. In fact, K-means usually exists as baseline in the comparison of various modern clustering algorithms.
  • Computing complexity is smaller than K-means, especially when running on highly-dimensional data such as text data or ordinary image data.

Suddenly, such a requirement is less than K-means. The result is better than K-means, and the result is faster than K-means. It is really hard to doubt whether it is a liar. Therefore, it's a scorpion or a horse. Pull it out and slide it out first. However, in the K-medoids article, I actually ran the K-medoids algorithm. The final result is an accuracy, and a number cannot be painted as a chart, it seems boring, and K-means is too slow to run, so here I am a little lazy, just reference the results in the next article.

The result is from Document Clustering Using locality preserving indexing. This paper is actually another clustering method (which will be discussed next time if you have a chance ), however, in its experiment, K-means and spectral clustering data are also extracted as follows:

K Tdt2 Reuters-21578
K-means SC K-means SC
2 0.989 0.998 0.871 0.923
3 0.974 0.996 0.775 0.816
4 0.959 0.996 0.732 0.793
...
9 0.852 0.984 0.553 0.625
10 0.835 0.979 0.545 0.615

Tdt2 and Reuters-21578 are two widely used standard text datasets. Although the results obtained from different datasets cannot be directly compared, however, the results of K-means and SC (spectral clustering) on the same dataset are clearly compared. In this experiment, data of several classes (from Class 2 to class 10) in these two datasets are extracted for clustering, the obtained accuracy is listed in the table (not all of them are listed ). Spectral clustering wins K-means.

With such a powerful algorithm, we can put on a mysterious name such as "spectral clustering". If it is not a model that is extremely complicated, or the universe of Bao Luo, it must be a treasure of a town or a secret, right? In fact, this is not the case. spectral clustering is not complex in terms of both model and implementation. You only need to be able to evaluate the feature values and feature vectors of the matrix. This is a very basic operation, any database that claims to support linear algebra should have such a function. The secret of spectral clustering is full of streets. You can find a copy from the stalls and open it to see the full picture of the spectral clustering algorithm:

  1. Construct a Graph Based on the Data. Each node of the graph corresponds to a data point and connects similar points. The Edge Weight is used to represent the similarity between data. This graph is represented in the form of an adjacent matrix and recorded. One of the lazy ways is to use the similarity matrix we used in K-medoids.
  2. Add the elements in each column to obtain the number of elements, and place them on the diagonal line (other places are all zero) to form a matrix, which is recorded. And.
  3. The first feature value (unless otherwise specified in this article) and the corresponding feature vector.
  4. This feature (column) vector is arranged together to form a matrix. Each row is regarded as a vector in the dimension space and then clustered using the K-means algorithm. In the clustering result, each row belongs to the same category as the nodes in the original graph or the original data points.

In this step, we made some strange changes to the data, and then secretly called K-means. So far, you can take it to the streets to cheat. However, if you still don't think it is reliable, let's talk about the reasons behind spectral clustering's "weird changes.

In fact, if you are familiar with dimen1_functions (dimensionality reduction), we can see that, spectral clustering is actually a process of reducing dimensionality through Laplacian eigenmap and then doing K-means-it sounds too much. But why should we just drop to the dimension? In fact, the entire model can be exported from another angle. Therefore, let's take a look at the question first.

In image processing (I seem to have heard that I hate this field before ?) One problem is that the image is segmentation (region division), that is, to make similar pixels into an area. For example, we generally want people in a photo (foreground) and the background are divided into different regions. There are already many automatic or semi-automatic algorithms in the image processing field to solve this problem, and many methods are closely related to clustering. For example, we used K-means to clustering pixels with similar colors when talking about Vector Quantization. However, this is not the real segmentation, because if we only consider similar colors, pixels that are far away from an image may also be aggregated into the same class. We generally do not refer to the pixels that are "independent" as a "area ", but this problem is actually well solved: as long as the location information is added to the feature used for clustering (for example, the original values R, G, and B are used to represent a pixel, add new values X and Y.

On the other hand, another frequently-studied problem is graph cut. Simply put, it is to cut some edges of a graph so that it can be split into independent connected sub-graph, the sum of the weights of these cut-off edges is calledCutValue. If all pixels in an image are used to form a graph, and nodes with similar colors and positions are connected, the weights on the edge indicate similarity, the problem of dividing an image into several regions is actually equivalent to the problem of dividing graph into several sub-graphs, and we can require that the cut value obtained by the split be the smallest, that is: the sum of the weights of the cut edge is the smallest. intuitively, we can know that the edge with a large weight is not cut, similar vertices are retained in the same sub-graph, and vertices with little relationship with each other are separated. We can think that such a segmentation method is better.

In fact, aside from the question of image segmentation, minimum cut is a widely studied problem in a series of graph cut problems, there are mature algorithms to solve the problem. However, simple minimal cut is not particularly useful here. In many cases, it simply splits the pixel that is the weakest to other pixels. On the contrary, we usually want to divide the area (size) relatively evenly, rather than some large blocks and some almost isolated points. Therefore, many alternative algorithms are proposed, such as ratio cut and Normalized Cut.

However, before continuing the discussion, we should first define the symbols, because it is difficult to clearly express the characters. Graph is represented as an adjacent matrix. It is recorded as the node-to-node weight. If the two nodes are not connected, the weight is zero. The cut between two subsets of graph (without intersection) can be formally defined:

First, consider the simplest case. If a graph is divided into two parts, then minimum cut is to be minimized (the complement set expressed in it ). However, as isolated nodes are often separated, ratiocut:

And normalizedcut:

It indicates the number of nodes in, while. Both of them can be regarded as a measurement of "size". By placing such an item on the denominator, we can effectively prevent isolated points and achieve relatively average segmentation. In fact, this PAMI paper: normalized cuts and image segmentation of jianbo Shi uses normalizedcut for image segmentation.

Moving out of ratiocut and normalizedcut is because they are very closely related to spectral clustering. Although the formula is simple, it is an NP-hard problem to minimize. To find a solution, let's perform deformation first.

To represent the set of all nodes of graph, first define a dimension vector:

Let's recall the matrix we first defined. In fact, it has a name called graph Laplacian. However, we can see that there are several similar matrices called this name:

Usually, every author just CILS "his" matrix the graph Laplacian.

In fact, it is understandable that all manufacturers now say that their technology is "cloud computing. This has the following properties:

This is true for any vector. It is a good proof that it can be obtained as long as it is expanded according to the definition. We can get the one we just defined.

In addition, if we make the vector of all elements 1, we can easily obtain the sum by directly expanding. Since it is a constant, minimizing ratiocut is equivalent to minimizing it. Of course, remember to add additional conditions and.

It would be easy to solve the problem, because there is a thing called the rayquotient:

The maximum and minimum values are equal to the largest and smallest feature values of the matrix, and the extreme values are obtained when they are equal to the corresponding feature vector. Because it is a constant, the minimization is actually equivalent to the minimization. However, because the smallest feature value is zero, and the corresponding feature vector is exactly (we only consider that graph is connected here ), therefore, we take the second small feature value and the corresponding feature vector.

At this point, we seem to have easily solved the previous NP-hard problem. In fact, we played a trick: the previous problem is NP-hard because the element of the vector can only take one of two values and is a discrete problem. The element in the feature vector we calculate can be any real number, that is to say, we have relaxed the limitations of the original problem. How can we get the original solution? The simplest way is to check whether each element is greater than zero or Less Than Zero and correspond them to the sum of discrete conditions, but we can also take a slightly more complex approach, k-means is used to aggregate the elements into two types.

So far, there has been a spectral clustering shadow: Calculate the feature value, and then perform K-means clustering on the feature vector. In fact, from two types of problems to K-class problems (I will not elaborate on the mathematical derivation), we will get the same steps as the previous spectral clustering: find the feature values and take the first K smallest, arrange the corresponding feature vectors, and then perform K-means clustering by row. Not bad!

Using a similar method, normalizedcut can also be equivalent to spectral clustering, but this time I will not talk so much about it, for more information about graph Laplacian and spectral clustering, see Tutorial: a tutorial on spectral clustering.

To ease the atmosphere, I decided to paste a simple MATLAB Implementation of spectral clustering:

function idx = spectral_clustering(W, k)    D = diag(sum(W));    L = D-W;     opt = struct(‘issym‘, true, ‘isreal‘, true);    [V dummy] = eigs(L, D, k, ‘SM‘, opt);     idx = kmeans(V, k);end

Finally, let's take a look at the advantages of spectral clustering in the beginning of this article:

  • You only need the similarity matrix of the data. This is obvious because all the information required by spectral clustering is included. But it is not always equal to the original similarity Matrix-recall that it is the adjacent matrix representation of the graph we have constructed. Generally, we need to facilitate clustering when constructing the graph, it is more powerful to the "local" connectivity, that is, to connect similar points. For example, we set a threshold. If the similarity between the two points is smaller than this threshold, they are considered non-connected. Another way to construct a graph is to connect n vertices most similar to the node.
  • The performance is better than the traditional k-means. In fact, spectral clustering uses feature vector elements to represent the original data and performs K-means in this "better representation form. In fact, this "better representation" is the result of dimensionality reduction using Laplacian eigenmap. If you have the opportunity, we will discuss dimensionality reduction in detail next time. The purpose of dimensionality reduction is to "grasp the main contradictions and ignore secondary things ".
  • The computing complexity is smaller than that of K-means. This is particularly evident in high-dimensional data. For example, text data is usually arranged in a sparse matrix with a very high dimension (for example, thousands or tens of thousands). It is very efficient to find feature values and feature vectors in the sparse matrix, the result is some K-dimensional vectors (usually K is not very large), and the k-means computation on these low-dimensional data is very small. However, if K-means is directly used for raw data, although the initial data is a sparse matrix, K-means has an operation for Centroid, which is to calculate an average value: the average value of many sparse vectors is not necessarily a sparse vector. In fact, in text data, centroid vectors obtained in many cases are very dense, when the distance between vectors is calculated, the calculation workload becomes very large, which directly leads to extremely slow K-means, more algorithms for spectral clustering and other processes are much faster.

Having said so much, it seems a bit messy, but it can only be stopped here. Finally, let's just say that spectral clustering is named spectral theory, which is the theory of using feature decomposition to analyze the problem.

Talk about clustering (4): spectral clustering <reprint>

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.