Summary of clustering algorithms in complex networks

Source: Internet
Author: User

Network, mathematically known as the graph, the earliest study began in 1736 Euler's Königsberg seven bridges problem, but later on the study of the graph developed slowly, until 1936, only the first book on the study of graph theory. In the 1960s, two Hungarian mathematicians, Erdos and Renyi, established the theory of random graphs, which was recognized as a systematic study of the complex network theory in mathematics. For the next 40 years, random graph theory has been used as the basic theory of complex network research. However, the vast majority of real-world networks are not completely random. In 1998, Watts and his mentor Strogatz nature's article "collective Dynamics of Small-world Networks" revealed the small World nature of complex networks. Then, in 1999, Barabasi and his doctoral student, Albert's article on Science, "emergence of Scaling in Random Networks", reveals the scale-free nature of complex networks (degrees distributed as power-law distributions), This opens a new era of complex network research.
As research progresses, more and more of the nature of complex networks has been unearthed, one of the most important of which is the 2002 Girvan and Newman in PNAS an article on Community structure in social and biological Networks, it is pointed out that clustering characteristics are ubiquitous in complex networks, and each class is called a community (community), and an algorithm is proposed to discover these communities. Since then, a lot of research on the problem of community discovery in complex networks has been made, and a lot of algorithms have been produced, this paper tries to simplify the clustering algorithm in the complex network, hoping to help those who want to understand this part quickly. The so-called community in this article is consistent with the concept of Class (cluster) in the clustering algorithm we will usually use.

0. Pre-knowledge

For the completeness of this article, we first give some basic concepts.
A graph is usually represented as g= (v,e), where V represents a collection of points, E represents a set of edges, and usually we use N to represent the number of nodes in the graph, and M for the number of sides. In a diagram, the number of edges associated with a point is called the degree of that point. For a graph, the sum of the degrees in the figure is exactly equal to twice times the number of edges. The graph is usually represented by adjacency matrix A, where the (i,j) position element of the adjacency matrix is 1 for point I to the right and 0 for infinity.
In this article we will use the concept of random graph, so-called random graph, refers to any two points in a graph the probability of the connection between the equal. A random graph is formed by first determining the n points and then using a fixed probability p to connect a pair of vertices in the graph. In the study, the random graph is usually used as a null model to compare with the actual network, which results in some properties.
The study of the Division of the community, a question to be solved is how to measure the division of a community is good or bad? A simple and intuitive principle is to make the edges of the community as much as possible, and there may be fewer sides between the communities. Another slightly more complex point, but the more commonly used measure is the concept of modularity (modularity) proposed by Newman and others, the basic idea is this: we assume that there is no such community structure in the random graph, compare the actual network with its corresponding random network, The greater the difference between a network follower network, the more obvious the community structure. In this way, we calculate a "density" for each sub-network that is divided, and then calculate the "density" of that subnet in random cases, where there is a difference between the two "densities", indicating the extent to which the sub-network deviates from the random situation, and the larger the value, the more dense the subnet is relative to the random network. The difference between all the sub-networks contained in a network is added together with the modular degree of this complex network, and the mathematical formula is expressed as follows:

Where AIJ represents the adjacency matrix of the graph, Ki represents the degree of the point I, M is the number of sides of the graph, and ki*kj/2m represents the expectation between the point I and the Point J edge. Further the module degree can be the form of the right side of the equation, NC is the total number of societies, LC is the number of sides in the Community C, DC is the sum of the points within the Community (note: Every point within the community may be the same as the internal points of the community, there may be with other associations point, it is usually dc> LC)

With this knowledge, let's take a look at the various algorithms of complex network community division.

1. The split of the graph

Dividing a network into multiple associations is to divide a graph into multiple graphs, the problem of which is a relatively difficult problem in graph theory, and a more research problem, theoretically np-hard. Therefore, people usually study the relatively simple situation: the figure of two split, that is, a graph into two (the general requirements of equal size) of the graph, the more famous algorithm is Kernighan-lin algorithm and spectral split method.

1.1 Kernighan-lin algorithm [1]

Thought: Constantly swapping points in two sub-graphs, making the edges between two sub-graphs as small as possible
Define the Gain function: q= number of sides within two societies-number of sides between societies
Algorithm steps:
Step1 randomly divided into two societies of a known size
STEP2 take one point from two societies, try swapping and calculating δq = Q Interchange before-Q Exchange, select Make Δq the largest pair of node pair swap
STEP3 specifies that each node can be exchanged only once, repeating step2 to the remaining nodes until ΔQ <0, or all nodes of one of the sub-graphs have been exchanged once.
STEP4 allows the second exchange of each node to start a new iteration until no node pair can be exchanged

1.2 Spectral Splitting method

The so-called "spectrum" is the eigenvalues of the Matrix; the so-called "divide" is to divide a graph into two sub-graphs of equal size. The spectral divide method is a method of clustering using the second small eigenvector of the Laplace matrix of graphs. Here for the first place, the following introduction of the spectral algorithm and then back to the specific introduction of spectral split method.

1.3 Summary

The subdivision of a graph often requires the number of sub-graphs (otherwise, the whole as one), or even the size of each sub-graph (otherwise, a point is often divided into a sub-graph). However, if a graph is known to be composed of two associations, the spectral scoring method can often get a good result.

2. Condensation method

This is a large class of algorithms, the overall idea is to find the community "the most central" side, and constantly put the most similar two points together, from the point of gathering to small groups, and then gathered into a large community. Many different algorithms are produced by defining similarities between two points. Some algorithms do not directly calculate the similarity, but instead look at the change of the module degree after the merging of two points, and select two points that need to be merged according to the size of the change. A more representative algorithm is an algorithm for Newman in the 2004 [2]

2.1 Newman algorithm

Thought: Continuous selection is the largest increase in the module of the two societies to merge
Algorithm steps:
Step1. Each point as a community
Step2. Calculates the change (growth) of the module degree after the merger of two societies ΔQ, and selects the two largest societies to merge ΔQ
Step3. Repeat STEP2 until finally merged into a community

Optimization of 2.2 Newman algorithm

Newman algorithm Complexity is O ((m+n) n), Clause,newman and Moore reduce the complexity of the algorithm to O (n (logn) ^2) using the maximum heap [3]
For this algorithm, many improvements have been devised, such as the following two strategies:
1. Multistep Greedy algorithm: Merging multiple societies per iteration
2. Normalization of δq to eliminate the impact of community size

2.3 Summary

Cohesion method has the advantages of simple, limited, no need to specify the number of associations in advance, can find the hierarchical relationship of the community. The disadvantage is the lack of global objective function; Once the two points are merged, they are always in a community and cannot be undone; In addition, this algorithm tends to be more sensitive to individual points

3. Division of Law

The basic idea of the splitting method is to find out the edges that are most likely to be in the community, to remove these edges, and naturally produce different associations. The representative algorithm is an algorithm of Girvan and Newman in the 2002 [4].

3.1 GN algorithm [4]

First, the definition of the etweenness is given, and the number of edges (Edge betweenness) is the shortest path through the edge.
Intuitively, the edge of the community has a higher betweenness, and the inner edge of the community betweenness relatively small, so, by removing these high betweenness, the community structure will gradually appear.
Algorithm steps:
Step1. Calculate the number of interfaces per edge (O (nm) with BFS)
Step2. Removing edges with the highest number of interfaces
Step3. If the requirements of the Community division are met, stop; Otherwise, turn step1.

Using a variety of other measures (such as edge clustering coefficients, etc.) to replace the edge-medium number, the algorithm also produces a variety of variants.

3.2 Other splitting methods

There are many splitting methods, such as the MST,JP algorithm and so on, not described here.

3.3 Summary

The cohesion method is corresponding to the splitting method, a bottom-up, a top-down, a constantly put points together, a constantly go to the side to separate points. The advantages and disadvantages of the splitting method are similar to condensed method.

4. Spectral algorithm

First, a concept-the Laplacian matrix of graphs (L-matrices) is given. Set A as the adjacency matrix of the graph, D is a diagonal array (the diagonal element is the degree of point I), then the Laplacian matrix of the graph is l=d-a. With respect to the L-matrix, there are many properties, such as: (1) The sum of the elements of each row of the matrix is 0, so that the matrix can be known at least one 0 eigenvalue, and the 0 eigenvalues correspond to a full 1 vector (,..., 1). (2) 0 The number of eigenvectors is the same as the number of connected branches; If a graph is connected, its Laplacian matrix has only one 0 eigenvalue and the remaining eigenvalues are positive. (3) The characteristic vectors of different eigenvalues are orthogonal.
The so-called spectrum refers to the eigenvalues of a matrix. Spectral algorithm is to use the adjacency matrix or the Laplace matrix eigenvector, the point projection to a new space, in the new space with traditional clustering methods (such as K-means) to cluster.

The general steps of the spectral algorithm are:
Step1. Calculate the front s eigenvector of a similar matrix (such as adjacency matrix A)
Step2. Make U a nxs matrix, each column is a eigenvectors
Step3. The first line of U as the coordinates of point I, using hierarchical clustering method or K-means to get the final society
One thing to note is that if the eigenvalues of the adjacency matrix are computed, the largest s eigenvalues are generally taken; if the eigenvalues of the Laplacian matrix are computed, the smallest (except 0) s eigenvalues are computed.

Before we mentioned the spectral divide method, we use the Laplacian matrix to the second small eigenvalues corresponding to the eigenvector (called the Fiedler vector) to cluster. Because it is necessary to divide a tiling into two sub-graphs, the points corresponding to the components of the Fiedler vector are divided into one class, and the corresponding points of the negative components are divided into another class.

The calculation bottleneck of the spectral algorithm is to compute the eigenvalues of the Matrix, because a few eigenvector can get a good clustering, so we only need to calculate the maximum number of eigenvalues can be considered with Lanczos method.

5. Matrix decomposition

The essence of spectral algorithm is matrix decomposition, other matrix decomposition methods and SVD and NMF, the whole idea of matrix decomposition is to map points from one space to another space, in the new space using the traditional clustering method to cluster.

6. Label Propagation algorithm

The label Propagation (label propagation) algorithm was proposed by Zhu X J in 2002 [5], which is a semi-supervised learning method based on graph, and its basic idea is to use tagged information of labeled nodes to predict tag information of unmarked nodes. In 2007, Raghavan U N was first proposed to apply LPA to community discovery, the algorithm was referred to as the RAK algorithm [6].

Idea: Each node assigns a label to its community, each iteration, and each node tag is modified according to the label of most of its neighboring nodes, and the node with the same label after convergence belongs to the same community.
Algorithm steps:
STEP1 randomly generates a label for each node
STEP2 randomly generates a sequence of all nodes, which in that order modifies the label of each node to the label of most of its neighbor nodes.
Step3 repeats the Step2 until the label of each node no longer changes, and the nodes with the same label form a community.

7. Random Walk

Random walk: When moving from one vertex to the next, a neighbor of the current vertex is selected at the same probability to be the following vertex.
Basic idea: A relatively dense sub-map of a community, so it is easy to "get into" a community when doing random walks in a diagram.

The random walk process forms a Markov chain. Each vertex in the graph corresponds to a state; the transition probability between different states is
The probability of the T-step random walk from I to J is Pij's T power

Here we introduce a representative random walk algorithm called Walktrap algorithm [7]

Walktrap algorithm

Define the following distances:
The distance between vertices I and J:
Community C to point J Distance:
Community C1 to C2 distance:

Algorithm steps:
Step1 each point as a community, calculating the distance between adjacent points (societies)
STEP2 selection makes the following minimum two societies C1 and C2 merged into one community,

Repeat this step until all points are merged into one community.

The label propagation algorithm, though fast, is not very effective.

8. Louvain (BGLL) algorithm

Louvain (BGLL) algorithm [8] is a heuristic algorithm based on the optimization of the module, the two-layer iteration of the algorithm, the outer layer of the iteration is the bottom-up condensation method, the inner iteration is condensed method plus exchange strategy, to avoid a simple condensation method of a large drawback (two nodes once merged, can not be separated).

Algorithm steps:
Step1 each point is initially treated as a community, traversing each vertex in a sequential order. For each vertex I, consider the change in the modularity of the community in which I is moved to its neighbor Vertex J Δq. If δq>0, move the vertex I to the community that makes the ΔQ the most varied vertex; Otherwise, vertex I remains motionless. Repeat this process until any vertex movement does not increase the module size.
STEP2 sees each community that Step1 receives as a new vertex, starting a new iteration until the module is no longer changing.

The algorithm is simple, intuitive, easy to implement, fast, and the effect is very good. Considering the efficiency and effect, the algorithm should be one of the best methods at present.

9.Canopy algorithm + k-means9.1 canopy algorithm

Thought: Choosing a method with lower computational cost to compute similarity, placing similar objects in a subset called canopy, which can overlap between different canopy
Algorithm steps:
STEP1 SetPoint set is S, preset two distance threshold of T1 and T2 (T1>T2);
Step2 a point p from S, a low cost method is used to quickly calculate the distance between the point P and all canopy, and the point P is added to the canopy within the distance of T1; if there is no such canopy, the point P as a new canopy center, and the point P distance in the T2 To remove the points within;
Step3 repeats Step2 until S is empty.

The algorithm is low precision, but it is fast, often as "coarse" clustering, get a K value, and then use K-means further clustering, does not belong to the same canopy between the objects do not do similarity calculation.

9.2 K-means

K-means Everyone is more familiar with the basic idea is to find the "central point" of the various societies, and then assign each vertex to the nearest
Algorithm step: Select K points as the initial center point of the K community
Step1. Assign each point to the community where the nearest center point is located;
Step2. Re-center point, if the central point does not change, stop; Otherwise, turn step1.
K-means algorithm calculation is relatively large, the effect is often good, but before using a bit: to find out the average center point by each component is meaningful, that is, in your question whether the European distance is meaningful.

10. Density-based fast clustering

There is a cluster article on science this year [9], which proposes a fast clustering method, with the basic idea of finding the center point of each class and assigning the remaining points to each class by a certain policy. The idea is very simple, but the method of finding the center point of each kind in the text is still very new.
Algorithm steps
Step1 for each point I, calculates two quantities: the density of point I and the minimum distance from point I to all other points higher than its density
Step2 the larger point as the central point of each community (the idea behind it is: the center of the class should be denser, the center of the same kind should be far away from each other)
STEP3 for the remaining non-central points, the community to which the nearest and denser neighboring point is located.

Reference documents

[1]. Kernighan & Lin, an efficient heuristic procedure for partitioning graphs. Bell System Technical Journal 49:291–307,1970.
[2] Newman. Fast algorithm for detecting Community Structure in Networks. PHY.REV.E, 2004.
[3] Clauset et al finding community structure in very large networks, PHY.REV.E, 2004.
[4] girvan& newman,community structure in social and biological networks, PNAS, 2002
[5] Zhu et al. Learning from labeled and unlabeled Data with Label propagation, Technical report, cmu-cald-02-107,2002
[6] Raghavan et al. near linear time algorithm to detect community structures in large-scale networks, PHY.REV.E., 2007.
[7] P. Pons et al. Computing communities in large networks using random walks. Journal of Graph algorithms and applications,2006.
[8] Vincent D. Blondel et al, Fast unfolding of communities in Large Networks, PHY.REV.E, 2008
[9] Rodriguez & Laio, clustering by fast search and find of density peaks. science.2014

Summary of clustering algorithms in complex networks

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.