powergraph:distributed graph-parallel computation on Natural Graphs (OSDI ')
2013-02-28 18:30:21| Category: Iteration diagram calculation | font size Subscription
This paper first presents the challenges in the existing parallel graph processing system, then introduces the Powergraph solution, and puts forward an effective partitioning scheme for power-law graphs.
Parallel graph processing systems such as Pregel and Graphlab are limited by the number of vertex neighbors, and the partitioning efficiency of graphs directly affects the communication overhead of the system. However, reality maps such as social networks and web maps are typical Power-law distributions, where a small number of vertex are connected to most vertex in the graph, and the graph partitioning of power-law graphs is a difficult problem in itself.
Powergraph the graph calculation based on vertex is abstracted into a general calculation model: gas model, which is divided into three stages: gather,apply and scatter.
1. Gather stage, user-defined a sum operation for each vertex, will vertex adjacent vertex and corresponding edge collection;
2. Each vertex in the application phase uses the sum value of the previous stage to update the original value;
3. The scatter phase uses the results of the second phase to update the vertex connected edge value.
Figure 1 Page Rank
Because the vertex computation frequently calls the gather stage operation, and most adjacent vertex values do not change, in order to reduce the amount of computation, Powergraph provides a cache mechanism, Figure 1 shows the powergraph mechanism under page The process pseudo code of rank calculation.
Powergraph a balanced graph partitioning scheme is proposed to reduce the amount of traffic in the computation while ensuring load balancing. Different from the hash random allocation scheme used by both Pregel and Graphlab, a balanced p-path vertex cutting (vertex-cut) partitioning scheme is proposed. The expected value of vertex cutting is computed according to the overall distribution probability density function of the graph:
The vertex is cut according to this expectation, and the traditional communication process is modified, as shown in Figure 2 below.
Figure 2 Communication process based on Vertex-cut
In the experiment, the Powergraph implemented three versions (global synchronous, global asynchronous, serializable asynchronous) according to the synchronous mode.
1. Global synchronization and Pregel similar, superstep set up a global sync point, to synchronize all edge and vertex changes;
2. Global asynchronous similar to Graphlab, all the Apply phase and scatter phase of the edge or vertex changes are immediately updated to the diagram;
3. Global Asynchrony makes the algorithm design and debugging become very complex, some algorithms may be less efficient than global synchronization, so there is a global asynchronous plus serializable combination of ways.
In error control, relying on the implementation of checkpoint, the Chandy-lamport snapshot algorithm used in Graphlab is adopted.
In this paper, the graph calculation model is abstracted, the balanced graph partitioning scheme is designed, and the system implementation is compared with three different modes, and the error control is realized.