*. * Author: Zhang junlin Excerpted from Chapter 14 "Big Data daily report: Architecture and algorithm". The book directory is Pregel, a large-scale distributed graph computing platform proposed by Google, it is used to solve large-scale distributed graph computing problems in practical applications such as webpage link analysis and social data mining. 1. The computing model Pregel follows BSP in the conceptual model.
/*. */Author: Zhang junlin Excerpted from Chapter 14 "Big Data daily report: Architecture and algorithm". The book directory is Pregel, a large-scale distributed graph computing platform proposed by Google, it is used to solve large-scale distributed graph computing problems in practical applications such as webpage link analysis and social data mining. 1. The computing model Pregel follows BSP in the conceptual model.
/*.*/
Author: Zhang junlin
This is excerpted from Chapter 14 "Big Data day: Architecture and algorithms". The books are listed in
Pregel is a large-scale distributed graph computing platform proposed by Google. It is designed to solve large-scale distributed graph computing problems involved in web page Link Analysis, social data mining, and other practical applications.
1. Computing Model
Pregel follows the BSP model in the conceptual model. The entire computing process consists of several Super steps executed in sequence. The system moves from a Super Step to the next one ", until the algorithm termination conditions are met (see Figure 4-13 ).
Pregel follows the node-centric pattern in the programming model.S, Each graph node can be summarized from the superstepS-1. The messages sent from other nodes change the status of the graph node and send messages to other nodes. After the messages are synchronizedS+ 1 is received and processed by other nodes. You only need to customize a computing function for graph nodes.F (vertex)To implement the preceding graph node computing function. Other tasks, such as task allocation, task management, and system fault tolerance, are implemented by the Pregel system.
A typical Pregel computation consists of Graph Information Input, graph initialization, and consecutive supersteps separated by global synchronization points. Finally, the computation results can be output.
Each node has two states: active and inactive. At the beginning of computing, each node is in the active state. As computing starts, some nodes complete computing tasks and change to inactive states, if a node in the inactive status receives a new message, it becomes active again. If all the nodes in the figure are inactive, the computing task is completed and the Pregel outputs the computing result.
Next we will introduce a specific computing task as an example of the Pregel graph computing model. This task requires that the maximum value of the nodes in the graph be propagated to all other nodes in the graph, 4-14 is its, the real-line arrow in the figure shows the link relationship of the graph, and the value in the figure shows the current value of the node. the dotted line in the figure shows the message transmission relationship between different supersteps, A graph node with a twill mark is inactive.
It can be seen that value 6 is the maximum value in the figure. in step 5, all nodes are active and the system executes user functions.F(Vertex): The node transmits its own values through the link relationship, selects the maximum value for the node that receives the message, and compares it with its own value. If it is larger than its own value, the value is updated to a new value. If it is not greater than its own value, it becomes inactive.
In The 0th supersteps, each node transmits its own values through a link. The system enters the 1st supersteps and executesF(Vertex) Function. The node of the first and fourth rows is updated to the new value 6 because it receives a value greater than its own value. The nodes in the second and third rows do not receive a number greater than their own values, so they become inactive. After the function is executed, the active node sends a message again, and the system enters the 2nd superstep. The node in the second row is inactive because it receives new messages, therefore, when the value reaches 6, the node is active again, and other nodes are inactive. Pregel enters the 3rd superstep, and all nodes are inactive. Therefore, the computing task is completed, the maximum value is transmitted to all other nodes in the figure through four supersteps. Algorithm 14.1 is the Pregel C ++ code that reflects this process.
2. System Architecture
Pregel uses the "master-slave structure" to implement the overall function. The 4-15 diagram shows the architecture. One server acts as the "master server" and is responsible for task splitting of the entire graph structure, cut it into subgraphs (Hash (ID) =IDModN,NIs the number of work servers), and the task is assigned to a large number of "work servers", "master server" command "work server" for each super step calculation, synchronize obstacle points and collect calculation results. The "master server" only performs system management and is not responsible for the specific graph computing.
Each "worker server" is responsible for maintaining the state information of subgraphs and nodes and edges allocated to it. In the initial phase of the operation, all graph nodes are in active state, call User-Defined Functions for currently active nodes in turnF(Vertex). It should be noted that all data is loaded into the memory for computing. In addition, the "working server" also manages the communication between the subgraphs of the Local Machine and the subgraphs maintained by other "working servers.
In the subsequent computing process, the "master server" notifies the "worker server" by command to start a superstep operation, and the "worker server" calls active nodes in turn.F(Vertex) When all active nodes are computed, the "worker server" notifies the "master server" of the number of remaining active nodes after the current round of calculation, computing ends until all graph nodes are inactive.
Pregel uses CheckPoint as its fault tolerance mechanism. Before the superstep starts, the "master server" can run the "worker server" command to write the part content of the data in charge to the storage point. The content includes the node value, edge value, and the message corresponding to the node.
The "master server" monitors the status of the "worker server" Through heartbeat monitoring. When a "worker server" fails, the "master server" re-allocates the corresponding data parts it is responsible for to other "worker servers ", the "worker server" that receives the re-computing task reads the latest "checkpoint" of the corresponding data shard from the storage point to resume work, the super step of the "checkpoint" may be slower than the super step of the current system. At this time, all "worker servers" are rolled back to the superstep consistent with the "checkpoint" to start computing again.
From the above description, we can see that Pregel is a message-driven synchronization graph computing framework that follows the graph node-centric programming model. Considering the uniqueness and physical uniqueness of the "master server", it is obvious that the Pregel may have a single point of failure.
Think about this: in the fault tolerance cycle selection, each super step can be performed once, or several super steps can be performed once. What are the advantages and disadvantages of these two methods?
A: If you select a fault tolerance measure for a short period of time, additional overhead will be required to complete the task, but the advantage is that if the machine fails, the entire system rollback history will be closer, it is helpful for tasks to be completed as soon as possible; fault-tolerant measures over a long period are the opposite. because of low frequency, the common overhead is low. However, if the machine fails, more supersteps need to be rolled back, the execution process of the task is extended. Therefore, there is an overall trade-off here.
3. Pregel Application
This section describes how to construct a specific application under the Pregel framework through several common graph computing applications.
(1) PageRank Calculation
PageRank is an important reference factor in search engine sorting. Its basic ideas and computing principles are described earlier in this chapter. The following is the C ++ sample code for PageRank calculation using Pregel.
The Compute () function isSCompute functions of the node in the superstepF(Vertex), You can quickly complete application development by inheriting the interface class Vertex and rewriting the Compute (MessageIterator * msgs) interface function, where MessageIterator * msgs isS-1 indicates the Message Queue that is passed to the current node in the superstep. The calculation function first accumulates some PageRank scores passed to the current node in the message queue, and then obtains the current PageRank score of the graph node based on the calculation formula. If the current superstep does not reach the cyclic termination condition for 30 times, then, the new PageRank value is passed to the adjacent node through the edge. Otherwise, an end notification is sent to make the current node inactive.
(2) single-source shortest path
Finding the shortest path between nodes in the graph is a very common graph algorithm. The "single-source shortest path" refers to the shortest distance from any other node in the graph to the initial node StartV. The following example shows how to calculate the single-source shortest path of a graph node on the Pregel platform.
The Code shows that a graph NodeVFind the currently seen shortest path from the Message Queue received in the previous superstep.VIf the obtained Shortest Path is small, the shorter path is found, the node value is updated to the new shortest path, and the new value is transmitted through the adjacent node, otherwise, the current node is converted to inactive. After calculation, if the shortest path of a node is still markedINFThere is no reachable path between the node and the source node.
(3) maximum matching of two diagrams
The maximum matching of a two-part graph is also a classic graph computing problem. The following describes how Pregel uses the random matching method to solve this problem.
The preceding Pregel program uses a random Match Method to Solve the maximum matching problem of two graphs. Each graph node maintains a binary group: ('L/R', matching node ID ), 'l/R' indicates whether the node is the left-side node or the right-side node in the diagram, and uses it as the identification mark. Another dimension of the Binary Group records the matched node ID.
The algorithm runs in the following four stages.
Phase 1: send a message to the adjacent node of the Left node that has not been matched in the two-part graph. The request is matched and then transferred to the inactive state.
Phase 2: For nodes not matched on the right of the two charts, select a randomly received node from the received request matching message and send a confirmation message to the left end node of the received request, and then actively transfer to the inactive status.
Phase 3: After a node that has not yet matched on the left receives the confirmation information, select a node to receive the message and write the matching node ID to indicate that the node has been matched, then, send the request receiving message to the corresponding node on the right side. A node that has been matched with a left-End Node does not have any action in this phase, because such nodes do not send any messages in the first phase.
Stage 4: select a stage 3 Request for a node that has not been matched on the right, and write the matched node ID to indicate that the request has been matched.
The maximum matching result of a two-part graph can be obtained through the continuous iteration of the four phases similar to the two handshakes.