Community Partitioning algorithm

Source: Internet
Author: User
Tags random shuffle shuffle advantage
Introduction

Using a lot of Internet data, we can build such a network, its nodes for some kind of information resources, such as pictures, videos, posts, news, etc., and the edge for users in the flow between resources. For such a network, the use of Community partitioning algorithm can reveal the correlation between information resources, the discovery of this correlation takes advantage of user information resources processing information, so compared to simply use the resources themselves to carry the information to cluster (for example, using the news contains keywords to cluster news resources), is a deeper knowledge discovery. two main ways of dividing network community

The algorithm of community partitioning is quite large and can be divided into two categories: topological analysis and stream analysis. The former is generally applicable to the non-right network, the idea is that the intra-Community edge density is higher than the community between. The latter applies to the network of the right to have, the idea is to find in the network of some kind of flow (material, energy, information) formed in the community structure. Each of these two analyses has its own characteristics, depending on the object that the network data itself describes and the information that the researcher wants to obtain.

We can classify some of the known algorithms into these two categories:

topology Analysis the degree of modularization of the computational network q-modularity

Q-modularity is an indicator defined within the [ -0.5,1] interval, and its algorithm is for a community structure that considers the difference between the number of connected edges and the expected value within each community. The more the actual connecting edge is higher than the stochastic expectation, the more the node has a tendency to focus on some communities, that is, the more obvious the modular structure of the network. Newman introduced the concept in 2004 to evaluate his own community-based cost-effective approach, but because the indicator is scientifically sound and makes up for this gap, it quickly becomes a common standard for general community partitioning algorithms. The specific calculation formula for q is as follows:

Where A is the adjacency matrix corresponding to the network G, if there is an edge from I to J, then aij=1, otherwise 0. M is the total number of connections, 2m is the total degree, aij/2m is the actual probability of the connection between two nodes. Ki and KJ are the degrees of I and J, respectively. If we maintain a network of degree distribution but the random shuffle of its connection, any pair of nodes in the shuffle after the existence of the probability of kikj/(2m) 2. The brackets in the above expression are the degree to which the actual connection probability between nodes is higher than the expected value. Followed by a two-dollar function, if the node ij belongs to the same community, then 1, otherwise 0, which ensures that we only consider the inner edge of the community. This definition is based on the node as the analysis Unit. In fact, if the Q indicator is viewed in the community as an analytic unit, it can be further reduced to the difference between EII and AI. The EII is the ratio of link in the first community to the total network link, and AI is the number of Community link between the first community and all other communities.

The above formula has clearly defined the Q, but in the actual calculation, the above requirements for the Community and its internal nodes to traverse, the computational complexity is very large. Newman (2006) The upper formula is reduced, the matrix is expressed as follows: We define the Sir as the n * r Matrix, n is the number of nodes, and R is the number of communities. If node I belongs to Community R, it is 1, otherwise 0.

So there are

where B is the modularity matrix and its elements are

The matrix of the column and all is 0, because the actual network and random shuffle after the network degree distribution is unchanged. In particular, in the case of only two communities (r=2), s can be defined as an n-long vector, node belonging to a community of 1, belonging to another community for -1,q can be written in a simpler form:

By dividing the community into possible spatial search, we can get a community partition that maximizes Q values. In this process, some numerical optimizations are involved, such as fast greedy and multilevel in table one, which are examples of quick searches in different ways. Take fast greedy as an example Newman (2006), which observes the increasing trend of Q by constantly merging communities, and gets a in the worst case complexity of about O (| E|*log (| v|) ), in the best case approach to the linear complexity of the algorithm. Edge betweenness for computing networks with edge tightness

This idea appears earlier (Newman, 2001). Freeman (1975) proposed an indicator called betweenness, which measures the extent to which a node in the network occupies a shortcut between other n-1 nodes. Specifically, for each pair of nodes to find the shortest path, get a n * (n-1)/2 of the shortest path set S, and then see how many of the shortest path in this set need to pass a specific node. Newman draws on this standard, but it is not used to analyze nodes but to analyze the edges. A edge betweenness is the shortest path in the S collection that contains the number of edges. After defining the betweenness of the edges, the community can be partitioned by iterative algorithms. The practice is to calculate the betweenness of all the edges first, then remove the highest value of the edge, then recalculate, and then remove the top value of the edge, so repeatedly, until all the connected edges in the network are removed. In this process, the network is gradually cut into smaller and component. In this process, we can also use q-modularity to measure the results of community partitioning. This algorithm is clearly defined and does not involve matrix arithmetic, but the problem is that the computational complexity is very large. the characteristic vector leading eigenvector for computing Laplace matrix of network

A network g with N nodes can be expressed as an n x n adjacency matrix (adjacency matrix) A. On this matrix, if there is a connecting edge between the nodes I and J, then aij=1, otherwise 0. When the network is non-direction, Aij=aji. In addition we can construct n x N's degree matrix (degree matrix) D. The element on the D diagonal is the degree of the node, for example, DII is the degree of node I, and all non-diagonal elements are 0. There is no problem of the choice of the degree of the analysis of the non-direction network, and the network should consider the use of the degree or the degree according to the analysis target. Minus the adjacency matrix by subtracting the degree matrix, the Laplace matrix is obtained, that is L = d-a.

Characteristic roots of L lambda_0 <= lambda_1 <= ... <= lambda_{n-1} There are some interesting properties. First, the smallest feature root is always equal to 0. Because if L is multiplied by a unit vector with n elements, it is equivalent to calculating the sum of each row, which is just the self-offset of the degree of the node, and the result equals 0. Secondly, the number of 0 in the characteristic root is the number of components in the non-net G. This means that if, in addition to the minimum feature root, no other feature root is 0, the entire network constitutes a whole.

In these characteristic roots, the second small characteristic root (or the smallest non-0 characteristic root) lambda_1 is also called algebraic connectivity (algebraic connectivity), and its corresponding eigenvectors are called Fidler vectors. When Lambda_1 >0, the network is a whole. The larger the lambda_1, the closer the network links to each other. From this definition, very much like the previous discussion of the q-modularity, in fact, in Newman2006 's article, actually discussed the mathematical correspondence between the two. For example, to analyze the sample network, the Laplace matrix can be obtained as follows:

The characteristic roots of this matrix are as follows: {5.5, 4.5, 4.0, 3.4, 2.2, 1.3, 1.0, 0}. Fetch lambda_1 = 1 o'clock, Fidler vector={0.29, 0.00, 0.29, 0.29, 0.29,-0.58,-0.58, 0.00}. Because the values of the Fidler vectors correspond to the nodes in the diagram, they can be written as {a:0.29, b:0.00, c:0.29, d:0.29, e:0.29, f:-0.58, g:-0.58, h:0.00}. Just from the sign of the element can be seen, the analysis suggested that we have the F and G nodes and other nodes separate, more detailed, the value of the elements of the investigation is recommended to divide the matrix into three communities, {{A, C, D, E}, {b, H}, {e, F}}. Back in the picture, we found that the community classification was basically reasonable. Stream Analytics Random walk algorithm walk Trap

P. Pons and M. Latapy in 2005 a network community partitioning algorithm based on random walk was proposed. They suggest that the difference between the distance between two and 3rd can be used to measure the similarity between two points, thus serving the community. The concrete process is as follows: First, the adjacency matrix A corresponding to the network g is normalized by row, and the probability transfer matrix (transition matrix) P is obtained. Using matrix computing to express this normalization process, you can write

Where A is the adjacency matrix and D is the degree matrix. Using the Markov property of P-matrix, it is known that its T-element Pijt represents the probability of random walk particles passing through T-Step from node I to J. Next, define the distance between the two-point ij as follows:

where T is the step of the stream. The step size must be chosen properly, because if T is too small to reflect the structural characteristics of the network, if T is too large, then the PIJT approach is proportional to the J's degree D (j), and the topological information of the random travel starting point I is erased. The authors suggest that the T experience value is between 3 and 5. K is a target node. So this formula describes the average flow transition probability of the IJ to the target node K after the T-step (because this probability is proportional to the degree d (k) of the intermediate node K), so it is divided by D (k) to remove the effect). The smaller the distance between IJ and all other points on the network, the more likely it is that IJ will be located in a similar position and closer to each other. It is worth noting that this idea is inappropriate if only one or a few of the target nodes are considered. Because Rij is actually just structural symmetry. It is possible that IJ is at both ends of the network, far away, but the distance to a node in the middle is equal. But since the formula requires K to traverse all the nodes in the network except for IJ, if the IJ is similar to all other nodes, the comparison may be that IJ itself is the neighbor, not just the symmetry of the structure. As the formula shows, the Rij expression can be written as a matrix expression, where PTI is the T-second line of P.

After defining the distance between any two points rij, you can generalize it and get the distance between the communities rc1c2

It is easy to see that this distance is similar to the distance between nodes, but this time is to calculate the distance of two communities to the target node K, and calculate the flow distance of a single Community c to node K, but also to the Community C in all nodes to K-flow distance averaging.

Once the similarity of nodes is extracted from the flow structure, community partitioning is a relatively simple clustering problem. For example, you can take a merge clustering method as follows: treat each node as a community first, and then calculate the flow distance between all communities that have a connected edge. Then, take two to connect to each other and stream the shortest community to merge, recalculate the distance between communities, and iterate so continuously until all nodes are put into the same community. The number of communities in this process is decreasing, resulting in a tree-map layering (DENDROGRAM) structure. In this process, you can use q-modularity changes to guide the direction of the search. label diffusion algorithm label Propagation

The idea of this algorithm stems from the von Neumann in the 50 's cellular automata model (cellular automata) and Bak and other people in about 2002 years to do the sand model. The basic principle of the algorithm is as follows: First, assign a distinct label (label) to each node in the whole network, and second, at each step of the iteration, let a node take the most popular label in its neighbor node (if more than one of the best candidate tags is selected), and finally, in the iterative convergence, Nodes that use the same label are grouped into the same community. The core of this algorithm is to simulate the diffusion of some kind of flow over the network through the diffusion of tags. Its advantage is that the algorithm is simple, especially suitable for analyzing the network shaped by the stream. In most cases you can quickly converge. The drawback is that the result of the iteration may be unstable, especially if the weight of the edge is not considered, if the community structure is not obvious, or the network is less than a few hours, it is possible that all nodes are grouped into the same community. Summary

In this paper, we summarize the different ideas of clustering the points using the community partitioning algorithm after the click-Stream network is constructed. It can be divided into two kinds of topological analysis and flow analysis, from the mathematical point of view, the former is mainly based on spectrum analysis (spectral analyses), the latter with Markov chain (Markov chain) as the main modeling tool

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.