Pregel and Spark GraphX's Pregel API

Source: Internet
Author: User
Tags sendmsg
IntroductionAfter the rise of Hadoop, Google released three research papers, respectively, the caffeine, Pregel, Dremel Three technology, these three technologies have also become Google's new "troika", One of the Pregel is Google's proposed framework for large-scale distributed graph computing. It is mainly used for calculation of graph traversal (BFS), Shortest Path (SSSP), PageRank calculation and so on.
In Pregel calculation mode, the input is a forward graph, and each vertex of the graph has a corresponding unique vertex ID (vertex identifier). Each vertex has properties that can be modified and whose initial values are defined by the user. Each forward edge is associated with its source vertex, and it also has some user-defined attributes and values, and it also records the ID of its destination vertex.
A typical Pregel calculation process is as follows: Read the input, initialize the diagram, when the diagram is initialized, run a series of supersteps, each time the superstep in the global angle of the independent operation, until the end of the entire calculation, output results.

There are two states of vertices in Pregel: Active state (active) and inactive state (halt). If a vertex receives a message and needs to perform a calculation then it will set itself as active. If you do not receive a message or receive a message, but find that you do not need to calculate, you will set yourself inactive. This mechanism is described in the following diagram:

calculation Process The calculations in Pregel are divided into "superstep", and these "superstep" are executed in the following process:
1, the first input graph data, and initialization.
2, each node is set to active state. Each node sends information to the surrounding nodes based on pre-defined SendMessage functions, as well as the direction (forward, reverse, or bidirectional) of the edge.
3. Each node receives information if it finds that it needs to calculate the information received is processed according to a pre-defined calculation function, and this process may update its own information. If you receive a message but do not need to calculate it, set your own state to inactive.
4. Each active node sends a message to the surrounding node according to the SendMessage function.
5, the next superstep start, like Step 3 continue to calculate, until all the nodes become inactive, the entire calculation process is over.
Here is a concrete example of the process: Suppose that there are 4 nodes in a graph, from left to right, the first 1/2/3/4 node. The number in the circle is the attribute value of the node, the solid line represents the edge between the nodes, the dashed lines are information sent between the different steps, and the shaded circle is the inactive node. Our goal is to make the attribute value of all the nodes in the graph become the largest one.

Superstep 0: First all nodes are set to active, and their property values are sent to neighboring nodes along the forward edge.
Superstep 1: All nodes receive information, node 1 and node 4 find themselves receiving a larger value than their own, so update their own nodes (this process can be considered as calculations) and remain active. Nodes 2 and 3 do not receive larger values than themselves, so they are not calculated, not updated. The active node continues to send its own property value to the neighboring node.
Superstep 2: Node 3 accepts the information and calculates that the other node does not receive the message or receives it but does not compute it, so only node 3 is active and sends the message next.
Superstep 3: Nodes 2 and 4 receive messages but do not calculate so inactive, all nodes are inactive, so the calculation ends.
There are two core functions in the Pregel Computing framework: The SendMessage function and the F (VERTEX) node calculation function.

Spark Graphx's Pregel API
Spark provides Pregel APIs in its GRAPHX component, allowing us to work with the graph data on spark with a Pregel computing framework. The following operations are performed on the Spark-shell, we create a diagram and then explain the operation of the Pregel with an example of finding the shortest path of a single source.
Before preparing for the work we need to import some of the possible packages:

Import org.apache.spark._
import org.apache.spark.graphx._
import Org.apache.spark.rdd.RDD

Then according to the Web-google.txt file generated in HDFs, this file can be downloaded in https://snap.stanford.edu/data/web-Google.html.
Val graph = Graphloader.edgelistfile (SC, "/spark/web-google.txt")

When you first build a diagram with edgelistfile, all vertices, edges, and triplets attribute values are integers 1 because I don't specify them.
CalculationFirst set the source point, here Set the source point is 0:
Val Sourceid:vertexid = 0


The diagram is then initialized:

Val initialgraph = graph.mapvertices (id, _) = if (id = = SourceID) 0.0 else double.positiveinfinity)

This code means that for all non-source vertices, the attribute value of the vertex is set to infinity, because we intend to use the property values of all vertices to hold the source point to the shortest path between that point. Set the source point to its own path length to 0 before you formally start the calculation, and the path length to other points is set to infinity if you encounter a shorter path to replace the current length. If the source point is not up to that point, the path length is naturally infinitely large.
The shortest path is now calculated:
Val sssp = Initialgraph.pregel (double.positiveinfinity) (
ID, dist, newdist) = Math.min (dist, newdist),// Vertex program
triplet = {//Send Message
if (triplet.srcattr + triplet.attr < triplet.dstattr) {
Iter Ator ((Triplet.dstid, triplet.srcattr + triplet.attr))
} else {
iterator.empty
}
},
(A, b) = Math.min (A, B)//Merge Message
)


Let's print some of the values in SSSP to see:

We can see that the shortest path from 0 to 354796 is 11, to 291526 unreachable.
The process is detailed in the next step:
When calling the Pregel method, Initialgraph is implicitly converted to the Graphops class, and the source code for the Pregel method in this class is as follows:

def Pregel[a:classtag] (
initialmsg:a,
maxiterations:int = Int.maxvalue,
activedirection:edgedirection = Edgedirection.either) (
Vprog: (Vertexid, VD, A) = VD,
sendmsg:edgetriplet[vd, ED] = iterator[( Vertexid, a)],
mergemsg: (A, a) + a)
: GRAPH[VD, ED] = {
Pregel (Graph, Initialmsg, MaxIterations, Activedi Rection) (Vprog, sendmsg, mergemsg)
}

This approach uses a typical method of curry definition, with the first parameter sequence in parentheses being initialmsg, maxiterations, and Activedirection. The first parameter, initialmsg, represents the message that each node receives when the first iteration is Superstep 0. MaxIterations represents the maximum number of iterations, Activedirection represents the direction in which the message is sent, the value is edgedirection type, which is an enumeration type with three possible values: edgedirection.in/ Edgedirection.out/edgedirection.either. As you can see, the second and third parameters have default values.
In the second bracket, the argument sequence is three functions, Vprog, sendmsg, and mergemsg, respectively.
Vprog is a user-defined calculation function on a node that runs on a single node, and in Superstep 0, this function runs on each node with the initial initialmsg and generates a new node value. The function will run only if the node receives information in the next step.
SENDMSG the node that receives information in the current step is used to send a message to the neighboring node, which is used for the next step of the calculation.
The mergemsg is used to aggregate messages sent to the same node, which has a parameter of two A-type message, and a return value of a type a message.
Finally, the Apply method of the Pregel object is called to return a graph object.
The source code of the Apply method is as follows, we can see that graph and calculation parameters are passed:
def Apply[vd:classtag, Ed:classtag, A:classtag] (GRAPH:GRAPH[VD, ED), initialmsg:a, Maxiterations:int = Int.MaxValue , activedirection:edgedirection = Edgedirection.either) (Vprog: (Vertexid, VD, A) = VD, SENDMSG:EDGETRIPLET[VD, ED]
= iterator[(Vertexid, a)], mergemsg: (A, a) + a): GRAPH[VD, ED] = {//requires that the maximum number of iterations is greater than 0, or error. Require (MaxIterations > 0, S "Maximum number of iterations must being greater than 0," + S "but got ${maxiterations}")//First
iterations, which are computed with the Vprog function for each node.
var g = graph.mapvertices (vid, vdata) = Vprog (vid, Vdata, initialmsg)). Cache ()////////based on the function that sends and aggregates information, calculates the information for the next iteration.  var messages = Graphxutils.mapreducetriplets (g, sendmsg, mergemsg)//Count how many nodes are active var activemessages = Messages.count ()// Enter loop iteration var prevg:graph[vd, ED] = null var i = 0 while (activemessages > 0 && i < maxiterations) {//Accept and update the node information PREVG = g g = g.joinvertices (Messages) (Vprog). Cache () Val oldmessages = messages//Send new messages, skipping Edges where neither side received a MessaGe. We must cache//messages so it can is materialized on the next line, allowing us to uncache the previous/*iteration here with Ma Preducetriplets implements the sending and aggregation of messages. Mapreducetriplets * parameter has a map method and a reduce method, here *sendmsg is the map method, *mergemsg is the reduce method */messages = Graphxutils.mapreducetriplets (g, sendmsg, Mergemsg, Some ((Oldmessages, Activedirection)). Cache ()//The Call to count ( ) materializes ' messages ' and the vertices of ' g '. This hides oldmessages//(depended on by the vertices of G) and the vertices of PREVG (depended on by oldmessages//and
The vertices of G). Activemessages = Messages.count () loginfo ("Pregel finished iteration" + i)//Unpersist the RDDs hidden by newly-materi alized RDDs oldmessages.unpersist (blocking = False) prevg.unpersistvertices (blocking = False) PrevG.edges.unpersist ( blocking = False)//count the iteration i + = 1} messages.unpersist (blocking = False) G}//End of apply


Let's take a look at the first single-source shortest path algorithm we started with:
The property value of the source vertex is set to 0 by first setting all the attribute values of all other vertices except the source vertex to infinity.
Superstep 0: Then initialize all vertices with initialmsg, in fact this initialization doesn't change anything.
Superstep 1: For each triplet: calculates the Triplet.srcattr + triplet.attr and triplet.dstattr comparisons, taking the first example: Suppose that there is an edge from 0 to a, then it satisfies the triplet.srcattr + Triplet.attr < Triplet.dstattr, the value of this triplet.attr is actually 1 (not specified by itself, the default value is 1), and 0 of the attr value we have already initialized to 0,0+1< infinity, so the message is sent (a, 1) This is distributed from SRC to DST in each triplet. If an edge is from 3 to 5, then triplet.srcattr + triplet.attr < triplet.dstattr is not true, because infinity plus 1 equals infinity, and the message is empty. Superstep 1 That is, this step after the completion of all the 0 points directly connected to the attr are 1 and become the Jump node, the other points of the attr unchanged at the same time into inactive nodes. The Slipknot point continues to send messages based on Triplet.srcattr + Triplet.attr < triplet.dstattr, and the MERGEMSG function aggregates multiple messages sent to the same node, and the result of the aggregation is the smallest value.
Superstep 2: All the nodes that receive the message compare their own attr and the attr sent over, the smaller value as their attr. Then you become a live node and continue to send attr+1 this message to the surrounding nodes before aggregating.
Until the attr with no nodes is updated, activemessages > 0 && i < maxiterations are no longer satisfied (the number of active nodes is greater than 0 and the maximum number of allowed iterations is not reached). At this point, the shortest path of node 0 to other nodes is obtained. This path value is stored in the attr of the other nodes.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.