Community Discovery
Community structure is closely related to graph segmentation in computer and hierarchical clustering in sociology.
Hierarchical clustering is a kind of traditional algorithm to find the structure of community in network, which can be divided into two kinds: condensation method and splitting method according to the addition of edge or removing edge from network.
One, based on the module degree
* * Non-power graph: * *
The module degree (modularity) is a common standard in recent years to measure the quality of Community division. The basic idea is to compare the network with the corresponding 0 model to measure the difference between the actual network and the random network.
Sum of the internal sides of the community: Q (Real) =1/2*sigma (A (IJ) *delta (c (i), C (j)))
0 expectations of the sum of the sides of the community inside the model: Q (NULL) =1/2*sigma (P (IJ) *delta (c (i), C (j)))
Module degree: The difference between the number of internal groups within the network and the number of internal margins of the corresponding 0 model is the ratio of the total number of network sides m:
Q= (Q (real)-Q (NULL))/M
For the network with direction and power, the number of edges in the module is replaced by the edge weight, and the point degree is replaced by the intensity of the point.
Community detection algorithm based on module degree: CNM algorithm, Complexity O (nlog2n), [* * Algorithm code * *] (http://www.cs.unm.edu/~aaron/research/fastmodularity.htm "CNM algorithm")
Due to the limitations of the module: Unable to identify the community of small enough, and the module can not directly compare the quality of the Community division of two networks.
# # #二, faction filtering algorithm
* * Basic Concepts: * * *k-clique (K-faction), adjacent, interconnected, K-clique community *
1. **k-clique (K-Group, K-faction):* * contains a fully-coupled sub-graph of K-nodes, that is, any two of the K-nodes have an edge connected to each other.
2. * * Adjacent: * * If two k-clique have k-1 public nodes, then two k-clique are adjacent.
3. * * Connected to each other: * * Using a number of * * Adjacent **k-clique to reach another k-clique, said two K-clique are connected to each other. 、
Algorithm:
```
1. Initial collection of A={v},b={v neighbors};
2. Move a node from set B to set a, while deleting the node in set B that is not connected to all nodes in set A;
3. If the size of collection A does not reach S, the binding B is already an empty set, or collections A and B are subsets of an existing larger faction, stop the calculation and return to the previous step. Otherwise when set
When the size of a is reached S, a new faction is obtained, the faction is recorded, and then back to the previous step, continue looking for the new faction that contains node v.
```
First, find the faction of the size K, then use the faction, and the definition of the adjacent, connected, and other definitions to find the K-faction community.
# # #三, even edge community algorithm
In 2010, Ahn, Bagrow, and Lehmann proposed new ideas for detecting overlap and layering: a community is a set of tightly connected edges, rather than a set of tightly connected points that are typically defined.
In this way, when a community is divided, it is a collection of edges, rather than a collection of points in a traditional algorithm.
Community testing and evaluation criteria:
1. Datum Map (Karate network, L-partition model (planted l-partition models))
The proposed community partitioning algorithm is carried out on these datum maps to detect the merits and demerits of the proposed algorithm from the perspective of real and typical data samples.
2. Meta-data
From other indicators to measure the community detection algorithm, 1, community quality 2, overlapping quality 3, community coverage 4, overlapping coverage
# # #四, other methods
Fast model (Block models)
Matrix decomposition method.
# # # # #联合聚类 [1]
1. Combined clustering, also known as two-part clustering (bi-clustering), is one of the clustering methods that can be used to cluster genes and their expression environment or text and words at the same time in the fields of gene expression and text analysis.
2. For training data expressed using matrices, you should consider using federated clustering when both rows and columns are relevant, because the information about the other dimension is ignored regardless of which dimension is clustered separately.
3. The basic principle of joint clustering: Iterate through the two steps of row clustering and column clustering until convergence.
# # #五, trial of related software:
1. Gephi
Visualization works better
For the * * Community Division algorithm: **louvain algorithm * *, the method is based on the degree of the module, the first of each node is considered a community, and then by assigning the node to its neighbor node, calculate the allocation of the number of neighbors after the module degree value, take the maximum module value as the division of this step, Then the division of the community into a "node", repeat the above steps, until the entire network module is no longer changed.
2. [Graphchi] (https://github.com/GraphChi/graphchi-cpp "Graphchi-github")
Dr. Aapo Kyrola of Carnegie Mellon University developed a project, a branch of Graphlab. Can be a single machine to complete large data graph calculation.
3. [Cfinder] (http://www.cfinder.org/"cfinder official website")
Find and visualize open source software for overlapping communities.
[1] co-clustering by Block Value decomposition SIGKDD ' 05
Learning notes for Community discovery