Reprint: http://blog.csdn.net/v_july_v/article/details/40738211
0 Introduction
On the morning of November 1, the 7th session of the Machine class, Shambo lecture cluster (PPT), in which the spectral clustering aroused his own interest, he from the most basic concept: unit vector, two vector orthogonal, matrix eigenvalues and eigenvectors, the similarity graph, Laplace matrix, finally the spectral clustering of the objective function and its algorithm flow.
After the class himself again pondering the cluster and Laplace matrix, intends to write a blog record learning experience, if there are deficiencies or suggestions, please feel free to point out, thanks.
1 Matrix Basics
Before the spectrum clustering, it is necessary to understand some basic knowledge of matrices.
1.0 Understanding the matrix of 12-point math notes
If the concept of the matrix has been blurred, it is recommended that one of the domestic people write the "Understanding matrix by Meng Yan" series, which throws a lot of interesting ideas, I previously in the process of reading some notes, as follows:
"1, in short: The matrix is the description of the transformation in the linear space, the similarity matrix is a different description of the same linear transformation." So, what is space? Essentially, "Space is a collection of objects that accommodate motion, whereas transformations specify the motion of the corresponding space" by Meng Yan. When a base is selected in a linear space, the vector depicts the motion of the object, and the motion is applied by multiplying the matrix with the vector. But what is a base? coordinate system also.
2, with the base, then in (1) said should be: The matrix is a linear space in the description of the transformation, the similarity matrix is the same linear transformation in different bases (coordinate system) different description. There are two questions, accesses than either what is a transform, and how do you understand it in different bases (coordinate systems)? In fact, the so-called transformation, that is, in space from one point (Element/object) to another (element object) of the transition, the matrix is used to describe the linear transformation. What about the base? By the previous known, The Matrix is nothing but a linear transformation in the linear space of a thing, the linear transformation is a noun, the matrix is a description of its adjectives, as described by the same person looks good can use a number of different adjectives "handsome" "Liang" description, The same linear transformation can also be described by a number of different matrices, and which Matrix describes it, determined by the base (coordinate system).
3, the front said the base, the coordinate system also, the image expression is the angle, looks at a question angle different, the description question obtains the conclusion is also different, but the conclusion does not represent the question itself, similarly, for a linear transformation, may select a group of bases, obtains a matrix to describe it, changes a group of bases, obtains the different matrix to describe The matrix simply describes the linear transformation of the nonlinear transformation itself, analogy to a person to choose a different angle to take pictures.
4, the front is said that the matrix describes the linear transformation, however, the matrix can not only be used to describe the linear transformation, but also can be used to describe the base (coordinate system/angle), the former good understanding, but by the transformation of the matrix of linear space in a point to transform to another point, but you say the matrix used to describe the base What does it mean to transform one coordinate system into another? In fact, the transformation point and the transformation coordinate system, the same!
(@ Karez scarf: Matrices can also be used to describe differential and integral transformations.) The key is what the base represents, and the coordinate transformation is the coordinates base. If the base is a wavelet or Furiyeki, it can be used to describe the wavelet transform or Fourier transform.
5, the matrix is the linear motion (transformation) description, the matrix and the vector multiplication is the implementation movement (transformation) the process, the same transformation in different coordinate system to behave as the different matrix, but the intrinsic/the characteristic value is same, the motion is relative, the object transformation is equivalent to the coordinate system transformation, if the point (2,3) changes to ( Accesses than either can let the coordinate point move, both can let the x-axis unit measure length into the original 1/2, so that the y-axis unit measure length into the original 1/3, before and after both can achieve the goal.
6, ma=b, sitting punctuation movement is a vector a after the transformation described by the matrix M, into a vector b, the variable coordinate system is a vector, it is measured in the coordinate system M is a, in the coordinate system I (I is the unit matrix, the main diagonal is 1, the other is 0) The result is B, Essentially point motion is equivalent to the transformation coordinate system. Why? As described in (5), the same transformation, different coordinate system performance of different matrices, but essentially the same.
7, Ib,i in (6) is said to be a unit coordinate system, in fact, we often say that the Cartesian coordinate system, such as Ma=ib, in the M coordinate system is a vector a, in the I coordinate system is a vector B, essentially the same vector, so the predicate matrix multiplication calculation is tantamount to identity recognition. Wait, what is a vector? Measured in a coordinate system, and the result of the measurement (the vector is projected on each axis) is arranged in order, which is a vector.
8, B in the I coordinate system is ib,a in the M coordinate system is MA, so the matrix multiplication of MXN, but n in the M coordinate system to measure MN, and M itself in the I coordinate system is measured. So a in the Ma=ib,m coordinate system is turned around in the I coordinate system in a quantity, but it becomes B. For example, the vector (x, y) in a Cartesian coordinate system with a unit length of 1 is one amount, and the x-axis unit length of the 2.Y axis is 31 (2,3).
9, what is the inverse matrix? Ma=ib, previously understood that the coordinate point transformation a-〉b is equivalent to the coordinate system transformation m-〉i, but how the specific m into I, the answer to let M times m inverse matrix. In Coordinate systems
For example, the x-axis unit measure length changes to the original 1/2,y axis unit measure length to the original 1/3, that is, with the matrix
By multiplying it into a Cartesian coordinate system I. That is, the transformation is applied to the coordinate system by multiplying it with the transformation matrix. ”
1.1 A bunch of basic concepts
According to Wikipedia, in the Matrix, the N-order unit matrix is a square matrix whose main diagonal element is 1 and the remaining elements are 0. The unit matrix is represented; if the order is negligible or can be determined by contextual, it can be précis-writers (or E). As shown, there are some unit matrices:
The column in the unit matrix is the unit vector. The unit vector is also a feature vector of the unit matrix, with a eigenvalues of 1, so this is the only eigenvalue and has a weight of n. Thus, the determinant of the unit matrix is 1, and the trace number is n.
What is a unit vector ? Mathematically, a unit vector in normed vector space is a vector of length 1. In Euclidean space, the dot product of two unit vectors is the cosine of the angle between them (because they are 1 in length).
A non-zero vector is a normalized vector (that is, the unit vector) that is parallel to the unit vector, which is recorded as:
Here is the norm (length).
What is dot product ? The dot product is also called the inner product, two vectors = [A1, a2,..., an] and = [B1, B2,..., bn] are defined as:
Here σ indicates the summation symbol.
For example, the dot product of two three-dimensional vectors [1, 3,-5] and [4,-2,-1] is:
Using matrix multiplication and the (vertical) vector as the nx1 matrix, the dot product can also be written as:
Here's the transpose of the indicator matrix. Using the above example, multiply a 1x3 matrix (that is, the row vector) by multiplying a 3x1 vector to get the result (the 1x1 matrix is scalar by the advantage of matrix multiplication):
In addition to the above algebraic definition, the dot product has another definition: the geometry definition. In Euclidean space, the dot product can be intuitively defined as:
here | | Represents the modulus (length), θ represents the angle between two vectors. According to this definition, the dot product of two vectors perpendicular to each other is always zero. If the and are all unit vectors (length 1), their dot product is the cosine of their angle.
orthogonal is the generalization of the visual concept of vertical, if the inner product space of two vectors in the inner product (that is, dot product) is 0, they are called orthogonal, equivalent to the two vectors perpendicular, in other words, if the angle between the vector can be defined, then orthogonal can be intuitively understood as vertical. The orthogonal matrix (orthogonal matrix) is a block matrix (block matrix, or matrix, which is the same as the number of rows and columns) of a unit vector in which the elements are real and the rows and columns are orthogonal. )
If the numbers and non-zero vectors satisfy, then a eigenvector is the corresponding eigenvalue . In other words, what is done in this direction is simply to lengthen/shorten a point along its direction (rather than an irregular multidimensional transformation), which is the proportion that stretches in this direction. In short, a pair of hands and feet makes the direction change long or shorter, but its orientation is unchanged.
The trace of a matrix is the sum of the diagonal elements of a matrix and the sum of its eigenvalues.
More matrix-related concepts can be found in the relevant Wikipedia, or matrix analysis and application.
2 definition of Laplace matrix 2.1 Laplacian Matrix
The Laplace matrix (Laplacian matrix), also known as the Kirchhoff Matrix, is a matrix that represents a graph. Given a graph with n vertices, its Laplace matrix is defined as:
Which is the degree matrix of graphs, which is the adjacency matrix of graphs.
Let me give you an example. Given a simple diagram, the following:
Convert this "graph" to the form of an adjacency matrix , as follows:
Add up each column of the elements to get the number, and then put them on the diagonal (0 elsewhere), forming a diagonal matrix, which is recorded as a degree matrix , as shown in:
According to the definition of the Laplace matrix, the Laplace matrix can be:
2.2 Properties of the Laplace matrix
Before introducing the properties of the Laplace matrix, we first define two concepts, as follows:
① for adjacency matrices, the sum of the weights of all edges between A and B sub-graphs in the diagram is as follows:
Where, defined as node-to-node weights, if two nodes are not connected, the weight value is zero.
② the weights of all edges adjacent to a node and the degree d defined as the vertex, multiple D forms a degree matrix (diagonal array)
The Laplace matrix has the following properties:
- is a symmetric semi-positive definite matrix;
- , that is, the minimum eigenvalue is 0, and the corresponding eigenvector is. Proof: * = (-) * = 0 = 0 *. (Also, don't forget the definition of eigenvalues and eigenvectors: If the numbers and non-zero vectors satisfy, then a eigenvectors is the corresponding eigenvalues ).
- There are n non-negative real eigenvalue values
- And for any one belonging to the real vector, the following formula is established
of which,,,.
Below, to prove the following conclusions, as follows:
3 Spectral Clustering
The so-called clustering (clustering), is to be a bunch of samples reasonably divided into two or k parts. From the perspective of graph theory, the problem of clustering is equivalent to a graph segmentation problem. That is, given a graph G = (V, E), the vertex set V for each sample, the weighted edge represents the similarity between each sample, the purpose of the spectral clustering is to find a reasonable method of dividing the graph, so that after the segmentation to form a number of sub-graphs, the weight of the edge of the connection of the different child graphs (similarity) The weight (similarity) of the edges within the same child graph is as high as possible. Birds of a feather, flock together, similar together, not like each other away.
There are several ways to divide/cut the vertex set of a graph into disjoint sub-graphs, such as
- Cut/ratio Cut
- Normalized Cut
- Not based on graphs, but rather as problems of conversion to SVD Energy Solutions
The goal is to let the weight and the minimum of the edges be cut off, because the weight and the smaller of the edges that are cut off, the smaller the similarity between the sub-graphs they are connected to, the farther apart, and the lower-similarity sub-graph can be cut off from it.
This article focuses on the first method mentioned above, briefly mention the second, the third one does not explain, interested can refer to the article at the end of the reference entry 13.
3.1 Related definitions
In order to better transform the spectral clustering problem into graph theory, we define the following concepts (some concepts have been defined before, the right is reviewed):
- No direction graph, Vertex set v represents individual samples, weighted edges represent similarity between individual samples
- The weights of all edges adjacent to a node and the degree d defined as the vertex, multiple D form a degree matrix (diagonal array)
- Definition of the similarity matrix. The similarity matrix is obtained by weight matrix, in practice, the similarity degree is calculated by Gaussian kernel function (also called radial basis function core), the greater the distance, the smaller the similarity.
- The indicator vectors for sub-figure A are as follows:
3.2 Objective function
Therefore, how to cut the diagram becomes the key of the problem. In other words, how to cut to get the best results?
For example, if you use all the pixels in a picture to form a graph and connect the nodes that are similar to each other (for example, color and position), the weights on the edges indicate similarity, and now you want to divide the image into several regions (or groups), requiring the Cut to be the smallest, which is equivalent to those The sum of the weights of the cut edges is the smallest, while the larger side of the weight is not cut off. Because this is the only way to allow the more similar points to be kept in the same sub-graph, the points with little connection to each other are separated.
Set to a few subsets of the graph (they do not intersect), in order to minimize the Cut value of the segmentation, the spectral clustering is to minimize the following objective function:
where k is divided into K-groups, representing the group I, the complement set, represents the sum of the weights of all the sides between the group and the group (in other words, if it is divided into K-groups, the cost is the summation of the weights of the edges that are stripped off when splitting).
In order to minimize the sum of the weights of the severed edges, the objective function is minimized. But in many cases, minimizing cut often leads to bad segmentation. In the case of 2 classes, the equation usually divides the graph into one point and the rest of the n-1 points. As shown, it is clear that the minimized smallest cut is not the best cut and instead divides {A, B, C, H} into one side, and {D, E, F, G} is probably the best cut:
In order for each class to have a reasonable size, the objective function tries to make A1,A2 ... Ak is big enough. The improved target function is:
Among them | A| represents the number of vertices contained in a group.
Or:
Among them,.
3.3 Minimizing
Ratiocutand minimization equivalence
Now, let's focus on the ratiocut function.
Target function:
Defines a vector, and:
Based on the properties of the previously obtained Laplace Matrix Matrix, it is known
Now put the definition into the formula, we will get a very interesting conclusion! The derivation process is as follows:
Yes, we actually have Ratiocutfrom the launch, in other words, the Laplace matrix L is closely related to the objective function Ratiocut we want to optimize. Further, because it is a constant, minimizing ratiocutis equivalent to minimizing it.
At the same time, the individual elements of the unit vector are all 1, so the direct expansion can be constrained: and, the specific derivation process is as follows:
Finally our new objective function can be written in the previous:
wherein, and because of this, there is: f ' F = n (Note: F is the premise of a column vector, F ' f is a value, a real value, FF ' is a n*n matrix).
Again, the definition of eigenvectors and eigenvalues is reminded before continuing the derivation:
- If the numbers and non-zero vectors satisfy, then a eigenvector is the corresponding eigenvalue.
Suppose =, at this moment, is the eigenvalue, which is the eigenvector. Both sides simultaneously left multiply, get =, and F ' f=n, where n is the sum of the number of vertices in the graph, so = n, because n is a fixed value, so to minimize, the equivalent is to minimize. So, next, we can just find the smallest eigenvalues and their corresponding eigenvectors.
But at this crucial last step, we are confronted with a tricky problem, namely, that the properties of the Laplace matrix obtained before are " the smallest eigenvalue is zero, and the corresponding eigenvector is exactly " known: its unsatisfied condition, therefore , What do we do? According to the Rayleigh-ritz theory in the paper "A Tutorial on spectral clustering" , we can take the 2nd small eigenvalue and the corresponding eigenvector.
Further, because in practice, the element in the eigenvector is a continuous arbitrary real number, so it can be greater than 0, or less than 0 corresponds to the discrete case, the decision is taken, or taken. If the first k eigenvector can be obtained, and the K clusters are K-means, then the problem of K clustering is extended from two clusters.
The required K-eigenvectors are the eigenvectors of the Laplace matrix (the eigenvalues of the Laplace matrix are computed, the eigenvalues are sorted from small to large, the eigenvectors corresponding to the eigenvalues are arranged in ascending order of the eigenvalues, and the first k eigenvectors are the first k eigenvectors we require)!
Therefore, the problem is transformed into: to find the first k eigenvalues of the Laplace matrix, and then to K-means the eigenvector corresponding to the first k eigenvalues. and two kinds of problems can be easily extended to the K-class problem, that is, to find eigenvalues and take the first k of the smallest, the corresponding feature vectors are arranged, and then K-means clustering. Two types of classification and multi-class classification of the problem, the same.
In this way, because the discrete solution is very difficult, but Ratiocut skillfully transforms a NP difficulty problem into a Chenglapuras matrix eigenvalue (vector) problem, the discrete clustering problem is relaxed into a continuous feature vector, the smallest series of eigenvectors corresponding to the best series partitioning method. The rest is to disperse the problem of relaxation, the feature vectors will be divided again, you can get the corresponding categories. I cannot but say wonderful!
3.4 Spectral Clustering algorithm process
The algorithm process for the fully available spectral clustering is as follows:
Each node of a graph,graph is constructed from data to correspond to a data point, the points are connected (and then the points that are already connected but not very similar are cut by Cut/ratiocut/ncut), and the weights of the edges are used to represent the similarity between the data. Put this graph in the form of an adjacency matrix, denoted as.
Add up each column of the elements to get the number, put them on the diagonal (all the other places are 0), the formation of a diagonal matrix, recorded as a degree matrix, and the results of the--the Laplace matrix.
The previous eigenvalues (the first refers to the size of the eigenvalues from small to large), and the corresponding eigenvectors.
This feature (column) vector is arranged together to form a matrix, each of which is considered to be a vector in the dimension space and clustered using the K-means algorithm. The category that each row belongs to in the result of the cluster is the category that the node in the original Graph belongs to, namely the original data points.
Perhaps you have seen that the basic idea of spectral clustering is to use the similarity matrix (Laplace matrix) between sample data for feature decomposition (by Laplacian eigenmap dimensionality reduction), and then the resulting eigenvectors are k-means clustered.
In addition, spectral clustering, compared to traditional clustering methods (such as K-means), requires only a similarity matrix between the data, rather than requiring the data to be vectors in the N dimensional Euclidean space as K-means.
4 References and recommended readings
- Meng Yan's Understanding Matrix series: http://blog.csdn.net/myan/article/details/1865397;
- 12-point Math Note for understanding matrices: http://www.51weixue.com/thread-476-1-1.html;
- A bunch of Wikipedia, like eigenvector: Https://zh.wikipedia.org/wiki/%E7%89%B9%E5%BE%81%E5%90%91%E9%87%8F;
- Wikipedia on the Laplace matrix Introduction: Http://en.wikipedia.org/wiki/Laplacian_matrix;
- Bozhi cluster PPT:HTTP://PAN.BAIDU.COM/S/1I3GOYJR;
- A very good English literature on spectral clustering, "a Tutorial on spectral clustering": http://engr.case.edu/ray_soumya/mlrg/Luxburg07_ Tutorial_spectral_clustering.pdf;
- Two discussions on matrices and eigenvalues: http://www.zhihu.com/question/21082351,http://www.zhihu.com/question/21874816;
- Spectral clustering: http://www.cnblogs.com/fengyan/archive/2012/06/21/2553999.html;
- Spectral clustering algorithm: http://www.cnblogs.com/sparkwen/p/3155850.html;
- Random Talk Clustering series: http://blog.pluskid.org/?page_id=78;
- "Mining of Massive Datasets" 10th chapter: Http://infolab.stanford.edu/~ullman/mmds/book.pdf;
- Tydsh:spectral Clustering:①http://blog.sina.com.cn/s/blog_53a8a4710100g2rt.html,②http://blog.sina.com.cn/s/blog _53a8a4710100g2rv.html,③http://blog.sina.com.cn/s/blog_53a8a4710100g2ry.html,④http://blog.sina.com.cn/s/blog_ 53a8a4710100g2rz.html;
- H. Zha, C. Ding, M. Gu, X. He, and H.d Simon. Spectral relaxation for K-means clustering. Advances in Neural information processing Systems (NIPS 2001). pp. 1057-1064, Vancouver, Canada. Dec. 2001;
- A study of spectral clustering methods in machine learning: http://lamda.nju.edu.cn/conf/MLA07/files/YuJ.pdf;
- Algorithm implementation of spectral clustering: http://liuzhiqiangruc.iteye.com/blog/2117144.
Talking about spectral clustering from Laplace matrix