Relationship Calculation Problem description
A two-degree relationship is a relationship between a user and a user who is discovered by a person who is a bridge of concern. Currently, Weibo has achieved a potential user recommendation through a two-degree relationship. The user's once relationship contains the attention, the two types of friends, two-degree relationship is concerned about the concern, the friends concerned, friends attention, friends of four types of friends.
If you want to for the total station billion users according to two-degree relationship and four bridge types recommended bridge weight of the highest TopN users, roughly estimated the total relationship between according, according to the original Mapreduce model to calculate the entire two-degree relationship, need to bridge the user as Key, its attention and fan 200 million table do Join, if the active user according to billions of meters, the average amount of attention according to hundreds, Join need to transfer the amount of data is hundreds of TB, while Mapreduce in the shuffle process intermediate results need multiple sequencing and landing to HDFS, according to this implementation of memory and bandwidth can not be satisfied, And in the timeliness also can not meet the business needs.
The two-degree relationship recommendation can be abstracted into all vertices that find a minimum distance of 2 to a specified vertex in a direction graph, and a vertex that satisfies the above conditions is called a two-hop neighbor of the vertex. This is a classic graph problem, and the use of distributed graph computing model has a great advantage in algorithm description and extensibility.
Original Web address: https://kknews.cc/tech/jv2mk4l.html
Let's take a two-degree relationship as an example and describe the following
As shown, a one-way arrow indicates a concern, a two-way arrow indicates a friend relationship, the number on the arrow indicates the edge weight, such as a to C1 bridge weight =b1 (0.5+0.6) +b2 (0.7+0.1) = 1.9, the recommended reason is a friend's friend. We need to take the whole station according effective attention relationship in accordance with the above model calculation of a two-hop neighbor C, and then remove the C a direct attention, and finally C by the bridge weight from high to low to take TopN.
Frame Selection
At present, the industry mainstream distributed Graph computing framework has giraph and GraphX. Giraph is an iterative graph computing system. The input computed by the giraph is a graph of points and straight-connected edges. For example, a point can represent a person, and an edge can represent a friend request. Each vertex holds a value, and each edge holds a value. The input depends not only on the topology logic of the diagram, but also on the initial values of the fixed points and edges.
Advertisement
Giraph is open source by Yahoo, the prototype is Google's Pregel, in 2012 has become the Apache Software Foundation open-source project, and has been supported by Facebook, to achieve a variety of improvements.
GraphX is an important part of the Apache Open source project Spark, the first Distributed Graph Computing Framework project of Berkeley Amplab, which was later integrated into Spark as a core component. GraphX is the API for parallel computing of graphs and graphs in spark, and in fact is the rewrite and optimization of Graphlab and Pregel on Spark (Scala), which GraphX the greatest advantage over other distributed graph computing frameworks, providing a stack of data solutions on top of spark A complete set of flow-chart calculations can be done conveniently and efficiently.
Advertisement
End-to-end PageRank Performance (iterations, 3.7B edges)
GraphX with Spark, a graph is represented as an RDD, a distributed set of data that can be loaded into memory. Because memory has a natural random-access feature, most of the operations of Spark are done in memory compared to the mapreduce sequence, making it more appropriate to handle graph problems. GraphX processing end-to-end graph iterations is faster than giraph (see) in run time, so we decided to use GraphX to do two-degree relational mining and recommendation.
two-degree relationship solving based on GraphX
Basic Concepts
Property Map: A property map is a forward-to-multiple graph with user-defined objects connected to each vertex and edge. There are multiple parallel (parallel) edges in multiple graphs that share the same source and destination vertices. The ability to support parallel edges simplifies modeling scenarios in which the same vertices have multiple relationships (such as likes and blocks), each vertex being a unique long type of VERTEXID as the vertex ID.
Advertisement
A property graph graph consists of two RDD: VERTEXRDD[VD] and edgerdd[ed], respectively, representing vertices and edges. VD and ED represent the attribute types for vertices and edges, respectively. Like RDD, they are immutable, distributed, and fault-tolerant. The most critical of these is invariance. Logically, all graphs are transformed and manipulated to produce a new graph; physically, GraphX will have a certain degree of invariant vertex and edge reuse optimization, transparent to the user.
is an example of a property map.
Vertices and edges: it is well known that the most basic elements of a graph structure are vertices and edges. GraphX describes a forward graph that has vertex attributes and edge properties. GraphX provides three views of vertex (Vertex), Edge (Edge), and Edge triples (Edgetriplet). The various diagram operations of the GraphX are also done in these three types of views. As shown, the vertex contains the vertex ID and vertex data (VD), the edge contains the source vertex ID (srcid), the destination vertex ID (dstid), and the Edge data (ED). The Edge ternary group is an extension of the edge, which provides the source vertex data and the destination vertex data of the edge based on the edge. In many graph calculation operations, the edge data and the vertex data connected by the edge are formed along the edge triples and then manipulated on the edge triples.
distributed storage of graphs
GraphX stores the graph data in an RDD distributed over the cluster's vertices, using the vertex rdd (Vertexrdd), Edge Rdd (EDGERDD) to store the vertex collection and edge collection. The vertex RDD distributes the vertex data in multiple partitions on the cluster by hashing it by the ID of the vertex. The side RDD is partitioned by the specified partition policy (Partition strategy), which is hashed by default with the srcid of the edges, and the edge data is distributed in multiple partitions in the cluster. In addition, the vertex RDD also has routing information for the vertex-to-edge RDD partition-the routing table. The routing table exists in the partition of the vertex Rdd, which records the relationship of the vertices within the partition to all the side RDD partitions. When the edge RDD requires vertex data, such as constructing an edge ternary, the vertex RDD sends vertex data to the edge RDD partition based on the routing table. As shown, the vertex rdd, the side Rdd, and the routing table are obtained by splitting the graph by Vertex segmentation method.
In the process of graph calculation, some edge calculation needs vertex data, that is, we need to form Edge ternary group view, such as the weight value of the PageRank algorithm, which needs to send the vertex weights to the side RDD partition where the edge is located. GraphX generates a repeating vertex view (replicatedvertexview) corresponding to the edge RDD partition from the vertex RDD based on the routing table, which acts as an intermediate rdd to transfer the vertex data to the side RDD partition. Repeated vertex views are partitioned by the side Rdd and carry the rdd of the vertex data, as shown in repeating vertex partition A with all the vertices in the edge RDD partition A, which is co-partition (that is, the same number of partitions and the same partitioning method) as the vertices in the edge rdd. When the graph is calculated, GraphX will repeat the vertex view and the edge Rdd for the Zipper (zippartition) operation, combining the repeating vertex view and the Edge RDD partition one by one to connect the edges to the vertex data so that the edge partition has vertex data. The entire formation of the side ternary process, only in the vertex RDD to form a repeating vertex view of the data movement between partitions, the zipper operation does not need to move the vertex data and edge data. Because the vertex data is generally much less than the edge data, and as the number of iterations increases, the number of vertices that need to be updated is decreasing, and the vertex data carried in the repeating vertex view is reduced correspondingly, which can greatly reduce the amount of data movement and speed up the execution.
GraphX stores vertex and edge data as an array in the partition of the vertex Rdd and Edge Rdd in order to not lose the element access performance. GraphX has built up a number of index structures in the partitions to efficiently implement fast access to vertex data or edge data. The structure of the graph does not change during the iteration, so the index structure in the vertex rdd, the Edge rdd, and the repeating vertex view can all be reused, and when another diagram is generated by one diagram, only the data storage array of the vertex Rdd and Rdd is updated. The reuse of index structure is the key to maintain high performance of GraphX, and it is also the main reason that the performance can be greatly improved compared with the native RDD implementation graph model.
Solution Process
First, a property graph is constructed, and the properties of each vertex are Attr map (dstid->distance), which is initialized to map (the vertex id->0). The two-degree relationship is then solved in two iterations.
First iteration: Iterates through each edge, marks the number of hops in the DST vertex attribute dstattr as 1 to the SRC Vertex, and the SRC is received and merged into the vertex attribute srcattr.
Second iteration: Traverse the edge to filter out the dstattr with a hop number of 1 key-value to the corresponding SRC vertex, and add Dstid to the bridge vertex, and finally aggregate these messages to get all 2-hop neighbors.
Best practices
Diagram Partitioning
As mentioned above, Graphx uses the Vertex-cut (point split) method to store the graph with three RDD to store the graph data information: vertextable (ID, data): The ID is vertex id,data is the vertex attribute
Edgetable (PID, SRC, DST, data): PID for partition ID,SRC as source vertex ID, DST for destination vertex ID, data for edge property
Routingtable (ID,PID): ID is vertex ID, PID is partition ID
The point split storage implementation looks like this:
The user can specify a different partitioning policy. Partitioning policies assign edges to individual edge partitions, vertex Master is assigned to individual vertex partitions, and duplicate vertex views cache a copy of the local border union point. The difference in the partitioning strategy affects the number of copies required for the cache, and the degree of equalization of the edges allocated by each edge partition, and the best strategy needs to be selected based on the structural characteristics of the graph.
The 4 kinds of edge partitioning modes that GraphX comes with are shown.
Considering our specific scenario, the bridge vertex B in the diagram after the first iteration will receive a message with its attention to the vertex, and its properties will change about 100 times times. In the second iteration, if the same vertex B is divided into different edge partitions, it will be copied to the repeating vertex view at the time of its property update, depending on the size of our graph, the amount of memory and bandwidth cannot be carried.
to partition an edge by Dstid
The idea of our partitioning is to take into account that the message is sent in the direction of the fan (SRC), as far as possible to divide the same dstid into the same partition to reduce the number of dstattr copies. As shown, avoid DST as a super vertex (focusing on a large number of users, with a large number of fans) being heavily copied to burst memory when the property is changed.
reasonably set partition (partition) Size
Each task can handle only one partition of data, which is too small to cause a large amount of data per piece, causing memory pressure, or a lot of executor's computational power can not be fully utilized, but if it is too large, it will result in too many shards and less execution efficiency. The recommended number of partitions is 3 to 4 times times the total number of cores in the cluster CPU.
Neighbor Message aggregation using aggregatemessages
Aggregatemessages is the most important API for GRAPHX, a new function added in version 1.2 to replace Mapreducetriplets. Currently Mapreducetriplets is finally using a compatible aggregatemessages. Performance increases by 30% when you switch to aggregatemessages. GraphX's Pregel can also realize the neighbor message aggregation, which is not adopted in the two-degree relational solution scenario, because one of the termination conditions is that the active vertex number of the received message is Activemessages 0, and Activemessages is called each iteration. Messages.count calculates the number of active vertices. But count this reduce method will produce a new job, which is very time consuming. According to our scenario, the number of iterations is determined two times, so choosing Aggregatemessages, whose return value is the active vertex vertexrdd that receives the message, avoids the use of count during the iteration.
Aggregatemessages is logically divided into three steps:
Generating messages from the side ternary group;
Sends a message to the vertex of the edge ternary group;
The vertex aggregates the received message.
It is divided into the map phase and the reduce phase.
Phase1.aggregatemessages
Map
GraphX uses the vertex RDD to update the repeating vertex view. Repeat vertex view and side rdd to Partition zipper (zippartitions) operation, transfer vertex data to the side RDD partition, realize the side ternary group view. A map operation is performed on the side RDD to generate a message (MSG) for each edge ternary group based on the function provided by the user, generating an RDD with the vertex ID, the message element, and the type rdd[(Vertexid, MSG)].
Phase2.aggregatemessages
Reduce
The reduce phase first partitions the message Rdd in the Step1 (using the vertex RDD partition function), and the partitioned message Rdd is exactly the same distribution as the vertex rdd element of the graph. When partitioning, GraphX uses the user-provided aggregate function to merge messages from the same vertex, resulting in a message rdd similar to the vertex rdd.
Also, when using aggregatemessages, you need to be aware of the parameter tripletfields, which is used to specify the message domain to send, all by default (SRC attribute, DST attribute, and Edge attribute). According to our model and algorithm, the direction of message sending is the opposite direction of attention, the data only need dstattr, so tripletfields can be set to TRIPLETFIELDS.DST. This will only replicate the DST properties of the vertices, reducing network transport overhead.
using Kryo serialization
Spark is using Java by default
serialization, performance, space performance are relatively poor, the official recommendation is Kryo
serialization, serialization is faster and the compression rate is higher. In the Spark UI, you can see the ratio of the total time spent on serialization, and the Kryo serialization after the RDD storage saves about 9 times times more space than Java serialization, see.
Image source: https://github.com/EsotericSoftware/kryo/wiki/Benchmarks-for-Kryo-version-1.x
Once the Kryo serialization mechanism is enabled, the following performance improvements can be brought:
The external variables used in the operator function, after using Kryo: Optimize the performance of the network transmission, can optimize the memory consumption in the cluster.
To persist the RDD, you need to specify Storagelevel as Storagelevel.memory_only_ser when calling persist, which optimizes memory usage and consumption, and the less memory the persistent Rdd consumes, the object that is created when the task executes, Do not occupy the memory frequently, the GC frequently occurs.
Shuffle: In the shuffle operation of the task between the stages, the task between the vertex and vertex can be used to pull and transfer the files through the network, at this time, if the data is transmitted over the network, it is also possible to serialize, then the kryo can be applied to improve the performance of the network transmission.
It is important to note when using: if it is a data type that we define ourselves, we need to register it in Kryo. The code is as follows:
Val conf = new Sparkconf.setmaster (...). Setappname (...)
Conf.set ("Spark.serializer", "org.") Apache.Spark.serializer.KryoSerializer ")
Conf.registerkryoclasses (Array (Classof[myclass1], Classof[myclass2]))
Val sc = new Sparkcontext (conf)
memory and Shuffle tuning
The following diagram shows the Spark on YARN memory structure
In the actual two-degree relationship solution, each stage is bounded by shuffle, the upstream stage is a map task, each map task divides the computed result data into multiple copies, each corresponding to each partition in the downstream stage, and writes it to disk temporarily, the process called Do shuffle write; the downstream stage does a reduce task, and each reduce task pulls the specified partition result data for all map tasks in the upstream stage through the network, which is called Shuffle Read and is finally completed Reduc The business logic of E, as shown,
The map task writes R shuffle file (with Sortshufflemanager), where R is the number of reduce tasks. Before writing a disk, the data is written to the memory buffer, and when buffer is full it overflows to the disk file, and the reduce task pulls the data that it needs, and if Map and Reduce occur on different machines, the network transport overhead is generated. In practice, if the memory is tense when shuffle, it is necessary to adjust the Spark.shuffle.memoryFraction parameter appropriately, which represents the proportion of memory allocated to the shuffle read task for aggregation operations in executor memory, the default value of 0 .2, it can be increased to avoid the aggregation process due to insufficient memory to read and write to the disk frequently.
The second is to set the file compression mode to snnapy compression, replace the original LZF, can reduce the map phase IO file Buffer memory usage
(400k per file-> 32k per file) conf.set ("Spark.io.compression.codec", "org.") Apache.Spark.io.SnnapyCompressionCodec ")
Note also in Spark.storage.memoryFraction, which represents the amount of memory used by the RDD Cache, which defaults to 0.6. After we use Kryo serialization, the RDD memory footprint is reduced to the original 1/9, so turn this parameter down to 0.3, freeing up more memory for executor to use.
Network parameter Tuning
The following error has occurred in the actual operation
Java.util.concurrent.TimeoutException:
Futures timed out after [second]
Solve
caused by a network or GC, the worker or executor does not receive heartbeat feedback from the executor or task. Increase the value of the Spark.network.timeout and change it to (5min) or higher as appropriate. The default is (120s).
Summary
This article mainly introduces some basic principles of Spark GraphX, and some thinking and practical experience in the recommendation of two-degree relationship of micro-blog, after the actual scene running, based on the friend's friend's recommendation of GraphX, it has better effect on prescription and referral conversion rate.
Reference Documents
Http://spark.apache.org/docs/latest/graphx-programming-guide.html
GraphX A Resilient distributed Graph System on Spark Annotated, https://amplab.cs.berkeley.edu/wp-content/uploads/2013/ 05/grades-graphx_with_fonts.pdf
Spark Graphx in Action
Https://endymecy.gitbooks.io/spark-graphx-source-analysis/content/vertex-cut.html
Https://spark.apache.org/docs/latest/tuning.html
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
Optimizing Shuffle performance in Spark, https://people.eecs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/ Reports/project16_report.pdf
https://www.iteblog.com/archives/1672
Http://sharkdtu.com/posts/spark-shuffle.html
Calculate two-degree relationship based on Spark GRAPHX