GRAPHX diagram operations in Spark Pregel detailed

Source: Internet
Author: User
Tags sendmsg

Because I am not able to express words, or many in the form of code, first show the test code, and then explain:

Package Com.txq.spark.test

Import Org.apache.spark.graphx.util.GraphGenerators
Import Org.apache.spark.graphx._
Import Org.apache.spark.rdd.RDD
Import Org.apache.spark. {sparkconf, Sparkcontext, Sparkexception, GRAPHX}

Import Scala.reflect.ClassTag

/**
* Spark GraphX Test
* @authorTongXueQiang
*/
Object Test {

System.setproperty ("Hadoop.home.dir", "d://hadoop-2.6.2");
Val conf = new sparkconf (). Setmaster ("local"). Setappname ("Testrddmethod");
Val sc = new Sparkcontext (conf);

def main (args:array[string]): Unit = {
/*
Val Rdd = Sc.textfile ("Hdfs://spark:9000/user/spark/data/sogouq.sample");//Sogou Search Log parsing
Val rdd1 = Rdd.map (_.split ("\ T")). Map (Line=>line (3)). Map (_.split (")");
println ("Total" +rdd1.count+ "line");
Val Rdd2 = Rdd1.filter (_ (0). ToInt = = 1). Filter (_ (1). ToInt = = 1);
println ("Search results and click-through rate are ranked first in a total of" +rdd2.count+ "line");
Val users:rdd[(Vertexid, (string,string))] = Sc.parallelize (Array (3L, ("Rxin", "Student"), (7L, ("Jgonzal", "postdoc" ), (5L, ("Franklin", "Prof")), (2L, ("Istoica", "Prof")));
Val relationships:rdd[edge[string]] = Sc.parallelize (Array (Edge (3l,7l, "Collab"), Edge (5l,3l, "advisor"), Edge (2l,5l, "Colleague"), Edge (5l,7l, "PI"));
Val defaultuser = ("Jone", "Missing");
Val Graph = graph (users,relationships,defaultuser);
Val result = Graph.vertices.filter{case (ID, (name,pos)) = = pos = = "Prof"}.count ();
println ("The number of job titles is prof:" + result + "X");
println (graph.edges.filter (e = e.srcid > E.dstid). Count ());
Graph.triplets.collect (). foreach (println)
Graph.edges.collect (). foreach (println) */
/*
Val Graph:graph[double,int] = graphgenerators.lognormalgraph (sc,numvertices = +). Mapvertices ((id,_) = id.todouble)
println (graph);
println (graph.vertices) */

/*
Val oderfollowers:vertexrdd[(int,double)] = graph.mapreducetriplets[(int,double)] (
Triplet =>{
if (Triplet.srcattr > Triplet.dstattr) {
Iterator ((Triplet.dstid, (1,triplet.srcattr)));
} else {
Iterator.empty
}
},
(A, b) = (a._1 + b._1,a._2 + b._2)
)
Val avgageofolderfollower:vertexrdd[double] = oderfollowers.mapvalues ((id,value) + = {
Value match{
Case (count,totalage) = Totalage/count
}
})

Avgageofolderfollower.collect (). foreach (println) */
Collect neighbor nodes, followed by custom methods
Collectneighborids (edgedirection.in,graph). foreach (line = {print (line._1+ ":"); For (Elem <-line._2) {print (Elem + ")};p rintln;});
Take the Google Web link file (followed by) as an example, demonstrate the Pregel method, find out from the V0 website, get the least number of steps through the linked site, similar to the nearby Map shortest path algorithm
Val graph:graph[double,double] = Graphloader.edgelistfile (SC, "Hdfs://spark/user/spark/data/web-google.txt", Numedgepartitions = 4). Mapvertices ((id,_) = id.todouble). Mapedges (edge = edge.attr.toDouble);
Val Sourceid:vertexid = 0;//defines the source page ID
Val g:graph[double,double] = graph.mapvertices ((id,attr) and if (id = = 0) 0.0 else double.positiveinfinity)
Pregel Bottom Call Graphops mapreducetriplets method, a moment to explain the source code
Val result = pregel[double,double,double] (g,double.positiveinfinity) (
(ID,VD,NEWVD) = Math.min (VD,NEWVD),//The function of this method is to update the value of the node Vertexid property to the new value, in order to facilitate the innerjoin operation
Triplets = {//map function
if (triplets.srcattr + triplets.attr < triplets.dstattr) {
Iterator ((triplets.dstid,triplets.srcattr + triplets.attr))
} else {
Iterator.empty
}
},
(b) = Math.min (A, B)//reduce function
)
Output, note that Pregel returns a graph that updates the Vertexid property, instead of vertexrdd[(VERTEXID,VD)]
Print ("Shortest node:" +result.vertices.filter (_._1! = 0). Reduce (min));//Pay attention to filter out the source node
}
Find the shortest point in the path
def min (A: (vertexid,double), B: (vertexid,double)):(vertexid,double) = {
if (A._2 < b._2) a else b
}
/**
* Custom collection of Vertexid Neighborids
* @author Tongxueqiang
*/
def Collectneighborids[t,u] (Edgedirection:edgedirection,graph:graph[t,u]) (Implicit m:scala.reflect.classtag[t],n : Scala.reflect.classtag[u]): Vertexrdd[array[vertexid]] = {
Val NBRs = Graph.mapreducetriplets[array[vertexid]] (
Map function
Edgetriplets = {
Val msgtosrc = (Edgetriplets.srcid,array (edgetriplets.dstid));
Val msgtodst = (Edgetriplets.dstid,array (edgetriplets.srcid));
Edgedirection Match {
Case Edgedirection.either =>iterator (MSGTOSRC,MSGTODST)
Case edgedirection.out = Iterator (MSGTOSRC)
Case edgedirection.in = Iterator (MSGTODST)
Case Edgedirection.both = throw new Sparkexception ("It doesn ' t make sense to collect neighbors without a" + "directi On. (Edgedirection.both is not supported.use edgedirection.either instead.) ")
}
},_ + + _)//reduce function
NBRs
}

/**
* Custom Pregel function
* @param graph graph
* The Vertexid property returned by the @param initialmsg
* @param maxinterations Iteration Count
* @param the direction of the activedirection side
* @param vprog function to update node properties to facilitate Innerjoin operation
* @param sendmsg map function, return iterator[a], General A is Tuple2, where ID is the recipient of the message
* @param mergemsg reduce function, usually merge, or take the minimum, maximum value ... Operation
* @tparam a want to get the Vertexid property
* @tparam properties of vertices in VD graph
* @tparam the Edge property in ED graph
* @return return to updated graph
*/
def Pregel[a:classtag,vd:classtag,ed:classtag] (graph:graph[vd,ed],initialmsg:a,maxinterations:int = Int.MaxValue, Activedirection:edgedirection = Edgedirection.either) (
Vprog: (vertexid,vd,a) = VD,
Sendmsg:edgetriplet[vd,ed] =>iterator[(vertexid,a)],
Mergemsg: (a,a) = A)
: graph[vd,ed] = {
Pregel0 (graph,initialmsg,maxinterations,activedirection) (vprog,sendmsg,mergemsg)//Call the Apply method
}
 

This is the intra-node join function, which returns VERTEXRDD
def Innerjoin[u:classtag,vd:classtag] (table:rdd[(vertexid,u)) (Mapfunc: (vertexid,vd,u) = vertexrdd[(VertexId, VD)]) = {
Val uf = (Id:vertexid, DATA:VD, o:option[u]) + = {
o Match {
Case Some (U) = Mapfunc (ID, data, u)
Case None = Data
}
}
}
Test option[t] def test (): Unit = {
Val map = Map ("A", "1", "B", "2", "C", "3");
Def show (Value:option[string]): String = {
Value match{
Case Some (x) + = X
Case None = "No value found!"
}
}
println (Show (Map.get ("a")) = = "1");
}
}

The following focus on Pregel, in order to facilitate, self-redefined a Pregel0

Package Com.txq.spark.test

Import org.apache.spark.Logging
Import Org.apache.spark.graphx. {edgedirection, Edgetriplet, Graph, Vertexid}
Import Scala.reflect.ClassTag

/**
* Custom Pregel object, handling ideas:
*/
Object Pregel0 extends Logging {
def Apply[vd:classtag,ed:classtag,a:classtag]
(Graph:graph[vd,ed],
Initialmsg:a,
Maxiterations:int = Int.maxvalue,
Activedirection:edgedirection = Edgedirection.either)
(Vprog: (vertexid,vd,a) = VD,
Sendmsg:edgetriplet[vd,ed] = iterator[(vertexid,a)],
Mergemsg: (a,a) = A)
: graph[vd,ed] =
{
① update operations on vertices
var g = graph.mapvertices ((vid,vdata) = Vprog (vid,vdata,initialmsg)). cache ();
②compute The messages, note that the Mapreducetriplets method is called, the source code:

def Mapreducetriplets[a] (

MAP:EDGETRIPLET[VD, ED] => iterator[(VERTEXID, A)],

Reduce: (A, a) => A),

activesetopt:option[(vertexrdd[_], edgedirection)] = None)

: Vertexrdd[a]



var messages = G.mapreducetriplets (sendmsg,mergemsg);
Print ("Messages:" +messages.take. mkstring ("\ n"))
var activemessages = Messages.count ();
LOAD
var prevg:graph[vd,ed] = null
var i = 0;
while (activemessages > 0 && i < maxiterations) {
③receive the messages. Vertices that didn ' t get any message does not appear in Newverts.
Inline operation, the returned result is Vertexrdd, you can see the following debugging information
Val newverts = g.vertices.innerjoin (Messages) (Vprog). cache ();
Print ("Newverts:" +newverts.take. mkstring ("\ n"))
④update the graph with the new vertices.
PREVG = g;//First backs up the old graph to facilitate the subsequent graph update and unpersist off the old graph
④, returns the entire updated graph
g = G.outerjoinvertices (newverts) {(vid,old,newopt) = Newopt.getorelse (old)}//getorelse method, meaning, if newopt exists, Returns NEWOPT, does not exist return old
Print (G.vertices.take) mkstring ("\ n"))
G.cache ();//new graph cache up, next iteration using

Val oldmessages = messages;//Backup, same as PREVG = g operation
Send new messages. Vertices that didn ' t get any message does not appear in newverts.so
Don ' t send messages. We must cache messages.so it can materialized on the next line.
Allowing us to uncache the previous iteration.
⑤ the next iteration of the new messages to be sent, cache up first
Messages = G.mapreducetriplets (Sendmsg,mergemsg,some ((newverts,activedirection)). Cache ()
Print ("Messages to be sent next iteration:" +messages.take. mkstring ("\ n"))
Activemessages = Messages.count ();//⑥
Print ("Number of messages to send next iteration:" + activemessages)//If activemessages==0, end of iteration
Loginfo ("Pregel finished iteration" + i);
It turns out that the old message and graph are not available, unpersist
Oldmessages.unpersist (blocking= false);
Newverts.unpersist (Blocking=false)//unpersist, it's not available.
Prevg.unpersistvertices (Blocking=false)
PrevG.edges.unpersist (Blocking=false)
i + = 1;
}
g//return to the last graph
}

}
Debug information for Output: (node closest to V0 node)
First Fall Generation:

Messages: (11342,1.0)
(824020,1.0)
(867923,1.0)
(891835,1.0)
Newverts: (11342,1.0)
(824020,1.0)
(867923,1.0)
(891835,1.0)
Messages to send for next iteration: (302284,2.0)
(760842,2.0)
(417728,2.0)
(322178,2.0)
(387543,2.0)
(846213,2.0)
(857527,2.0)
(856657,2.0)
(695578,2.0)
(27469,2.0)
Number of messages to send next iteration: 29

Messages to send for next iteration: (754862,3.0)
(672426,3.0)
(320258,3.0)
(143557,3.0)
(789355,3.0)
(596104,3.0)
(118398,3.0)
(30115,3.0)
Number of messages to send next iteration: 141
And so on and on, until Activemessages = 0 The end of the fall.

The cache above needs to be: Graph,messages, The Create RDD and transformation operations in Newvertis.spark are lazy, storing only memory addresses, not actually creating objects, and when you do action, you need to run it from beginning to end, so after the cache, the RDD is reused, and when you do the action again, the speed greatly improved. After the unpersist, it cannot be later, so need to put the old backup.

The general use of mapreducetriplets can solve a lot of problems, why Spark Graphx will provide Pregel API? Mainly to make it easier to do iterative operations. Because in Graphx, graph does not have an automatic cache, but a manual cache. But in order for each iteration to be faster, you need to do the cache manually, each iteration will need to delete the useless and put the useful reservation, which is more difficult to control. Because the points and edges in graph are separate caches, Pregel can help us. For example, Pangerank is ideal for use with Pregel.

web-google.txt.gz File: http://snap.stanford.edu/data/web-Google.html

Tong's production, will be a boutique! Focus on Spark GraphX, data mining, machine learning source code and algorithms, solid, write good every line of code!

GRAPHX diagram operations in Spark Pregel detailed

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.