Cluster Series-Spectral clustering (spectral clustering) _ Clustering

Source: Internet
Author: User

  Clustering to this, but also my cluster series of the last blog, the last one, we will talk about the spectrum cluster.

Spectral clustering (spectral clustering) is a clustering method based on graph theory, the main idea is to regard all data as points in space, which can be connected with edges. The edge weights between two points that are far away (or less similar) are lower, while the distance (or higher similarity) between the two points of the value of the edge is higher, through the map of all the data points, so that the Chettu after the different sub-graph between the weight and as low as possible, and the sub-graph of the edge weight and as high as possible, So as to achieve the purpose of clustering

Spectral clustering compared to the most traditional k-means (of course, is also a good clustering method, but we all take it to do a baseline) clustering method, spectrum clustering has many advantages:

1. Only need to wait for the cluster point of similarity between the matrix can be done clustering.

2. For irregular data (or outlier) is not so sensitive, personal feeling is mainly embodied in the formula for minimizing graph cutting (because in the Ratiocut optimization strategy divided by a denominator | A|, indicating the number of points in the current cluster; another strategy is ncut, divided by a denominator, vol (a), that represents the set of degrees of points in the current cluster; With these two denominators, if the outliers are separated into a cluster, the values of the two optimization functions become larger.

3.k-means Clustering algorithm is more suitable for convex data sets (the connection between any two points in the dataset is within the dataset, simple understanding is circular, may not be accurate), and spectral clustering is more general.

From the idea of spectral clustering, we can know that the principle of this algorithm is simple, but to fully understand this algorithm, we need to have a certain understanding of the graph theory of the direction-free graph, linear algebra and matrix analysis. We are here on these mathematical knowledge, because there are a lot of good articles in the introduction of this knowledge, and then I will give a connection, we go to see it.

The specific steps of spectral clustering:

1. The similarity degree matrix S construction , constructs the similarity matrix the process, may use the Euclidean distance, the cosine similarity, the Gaussian similarity and so on calculates the similarity between the data points, chooses which should according to your own actual situation. However, it is recommended that Gaussian similarity be used in spectral clustering, but I use cosine similarity in my project.

2. The construction of similarity Chart G , there are three kinds of methods mainly:

Neighbor Graph : In this diagram, we connect points that are less than a certain threshold.

K-Nearest neighbor graph: in this graph, take point i,j as an example, if I is one of J's K nearest neighbors, then I and J are connected. Of course, this causes the similarity graph G to be a direction graph, because I is one of the K nearest neighbors of J, but J is not necessarily one of k neighbors of I. Very much like our lives: You are my best friend, but I am not necessarily your best friend. There are two ways to translate the graph at this point into a direction-free graph: (1) Ignoring the directionality of the edges, that is, if I am one of the K nearest neighbors of J or J is one of K nearest neighbors, then we will be connected with each other. (2) Only if satisfies I is one of J's K neighbors and J is one of K neighbors of I, we will be connected. The graph obtained by (1) method is called K-nearest neighbor graph, and the graph obtained by (2) method is called reciprocal K-neighbor graph.

Full Connection Diagram : In this diagram, we connect all the points together, while the weights of all the edges are set to the similarity degree. The key of this kind of graph is the establishment of similarity function, and the similarity function can reflect the neighboring relation in practice well. The example of a similarity function is the Gaussian similarity function: The selection of the parameters is a difficult one. The function of the parameter in the method is similar to that in the neighbor graph.

By constructing the above similarity graph G, we can get a new adjacency matrix W.

3. Laplace Matrix

Its definition is very simple, the Laplace matrix is l=d−w. D is the degree matrix, that is, each row of the similarity matrix (or each column) plus and get a diagonal matrix. W is the adjacency matrix of graphs.

4. The cutting of the non-direction graph (that is, the eigenvalues and eigenvectors of L), then the eigenvector matrix is K-means and clustered.

This is involved in a lot of, graph segmentation theory, the optimization of the segmentation, how to split from the graph into the characteristics of Laplace to find the eigenvalues and eigenvectors and so on my that question, here is not much introduction, look at my back of the reference is completely understandable.

Algorithm Flow:

input: Sample set d= (x1,x2,..., xn), generation of similar matrices, dimension K1 after dimensionality reduction, clustering method, dimension after clustering K2

Output: Cluster partition C (c1,c2,... Ck2)

.

1 to construct the similarity matrix S of the sample based on the form of the input similarity matrix

2 Constructing the adjacency matrix W based on the similarity matrix S, constructing the degree matrix D

3) Calculate the Laplace matrix L

4) to find the smallest K1 eigenvalue of L corresponding eigenvector f

6 characteristic matrix F of nxk1 Dimension is composed of eigenvector

7 for each row in F as a sample of K1 dimension, a total of n samples, clustering with the input clustering method, clustering dimension is K2.

8) Get cluster partition C (c1,c2,... Ck2)

Code:

because of their own written code efficiency is slow, so borrowed the Sklearn packet inside the spectral clustering algorithm, the specific code is as follows:


#coding: Utf-8 from numpy import * Import NumPy as NP from Sklearn.cluster import spectralclustering #定义欧式距离 def ou_dis (v1 , v2): Return norm (v1-v2) #定义余弦相似度 def cosin_sim (v1,v2): Product=v1.dot (v2) norma=sqrt (V1.dot (v1)) normb=sqrt (V2.dot (v
	2) return product/(NORMA*NORMB) #定义高斯相似度 def gaussian_simfunc (v1,v2,sigma=1): tee= (-norm (V1-V2) **2)/(2* (sigma**2)) Return exp (tee) #构建相似度矩阵W def construct_w (VEC): n = Len (VEC) W = Zeros ((n, N)) for I in Xrange (n): if (i% 1000 = =0): print "c--" +str (i) for J in Xrange (I,n): w[i,j] = w[j,i] = Cosin_sim (Vec[i], vec[j]) #选用的是高斯相似度函数就算相似度 re                                Turn W if __name__== ' __main__ ': F=open (R ' E:\test_sc\vec_30.txt ') #打开待聚类文本 list=f.readlines () #将数据全部读到list中 Row=len (list) #得到待聚类的数据点的个数 Col=len (list[0].spli T ())-1 #得到待聚类的数据点的维度 (because contains the label, so subtract one) dict_my={} #key-Number, V alue-words, a dictionary of numbers and words, with the wordsGrouped into categories using flag=0 #计数作用 Vec=zeros (row,col) #存放词语的向 Amount for line in list: #循环遍历每一个样本点, save tags to dict, and store vectors in Vec if Flag%1000==0:print ' read-                                 -' +STR (flag) S=line.split () dict_my[flag]=s[0] del s[0] vec[flag]=s flag+=1-F.close () W=construct_w (VEC) #构建相似度矩阵 labels=spectralclustering (n_clusters=6,affinity= ' nearest_neighbors ', n_neighbors=4 , eigen_solver= ' Arpack ', n_jobs=20). Fit_predict (W) reldict={} for I in range (8): reldict[i]=[] num=0 to I in labels : Reldict[i].append (Dict_my[num]) num=num+1 File_w=open (R ' E:\test_sc\rel_200.txt ', ' W ') for Key,value in Reldict.item S (): File_w.write (str (key) + ': ' for word in Value:file_w.write (word+ "") file_w.write ("\ n") File_w.close () PR int ' over-----'
Cluster effect display:


Reference directory:

1.http://www.cnblogs.com/pinard/p/6221564.html

2.http://www.cnblogs.com/fengyan/archive/2012/06/21/2553999.html

3.http://blog.csdn.net/zhangyi880405/article/details/39781817

4.http://www.cnblogs.com/pinard/p/6221564.html

5.http://blog.csdn.net/u014568921/article/details/49287565#t8

6.http://www.cnblogs.com/pinard/p/6235920.html

7.http://scikit-learn.org/dev/modules/generated/sklearn.cluster.spectralclustering.html#sklearn.cluster.spectralclustering

Dai Julei

Code

Code


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.