Architecture of Apache Spark GRAPHX

Source: Internet
Author: User

1. Overall architecture
The overall architecture of the GraphX (1) can be divided into three parts.

  

Figure 1 GraphX Architecture

Storage and Primitive Layer: graph class is the core class of graph computation. Internal contains Vertexrdd, Edgerdd, and Rdd[edgetriplet] references. Graphimpl is a subclass of graph class, which realizes graph operation.
? Interface layer: The Pregel model is implemented on the basis of the underlying RDD, and the computation interface of BSP mode is realized.
? Algorithm layer: The common graph algorithm is realized based on Pregel interface. Including: PageRank, Svdplusplus, Trianglecount, Connectedcomponents, stronglyconnectedconponents and other algorithms.
2. Storage structure
In formal industrial applications, the scale of the graph is enormous and the millions of nodes are often present. In order to improve processing speed and data volume, we want to be able to store and process graph data in a distributed way. There are roughly two ways of distributing storage for graphs, edge cut and point splitting (Vertex cut), 2. In the framework of the earliest figure calculations, the use of Edge Cut (Edge split) storage. GraphX's designers take into account the fact that large-scale graphs in the real world are mostly graphs with more edges than dots, so they are stored in point-splitting mode. Point splitting can reduce network transport and storage overhead. The underlying implementation is to place the edges on each node store, while the data is exchanged to broadcast the points between the machines. The algorithms for partitioning and storing edges are based primarily on the partitioning method encapsulated in Partitionstrategy. Several partitioning methods are used to weigh the different application scenarios, and the user can choose the partitioning method according to the specific requirements. Users can specify how edges are partitioned in the program. For example:
Val g = Graph (vertices, partitionby (edges, partitionstrategy.edgepartition2d))
  

Figure 2 GraphX Storage model

Once the edge has been partitioned and stored on the cluster, the key challenge of massively parallel graph computing becomes how to connect the attributes of a point to the edge. The processing of GraphX is the attribute data of the moving propagation point on the cluster. Since not every partition requires all of the point attributes (because each partition is just a subset of the edges), GraphX maintains a routing table (routing table) internally, which can be mapped through the routing table when a broadcast point is needed to the partition where the edge of the point is needed. Transfers the desired point property to the specified edge partition.

The advantage of point splitting is that there is no redundant data on the side of the storage, and for the interaction of a point with its neighbor, as long as the Exchange law and the binding law are satisfied. For example, the sum of vertices ' adjacency vertex weights can be parallel in different nodes, and finally the result of each node is summarized, and the network overhead is small. The price is that each vertex attribute may be redundant to store multiple copies, with data synchronization overhead when updating point data.
3. Tips for use
The sampling observation can be used to calculate the small data, observe the effect, adjust the parameters, and then gradually increase the amount of data for large-scale operation by different sampling scales. Sampling can be done via the RDD sample method. With
The resource consumption of the cluster is observed through the Web UI.
1) Memory release: Preserves references to old graph objects, but frees up the vertex properties of unused graphs as soon as possible, saving space consumption. Vertex release through the Unpersistvertices method.
2) GC Tuning, please refer to the section on performance tuning.
3) Debugging: At each point in time can be debugged through Graph.vertices.count (), the existing state of the observation chart. Conduct problem diagnosis and tuning.
GraphX simplifies the complexity of user-developed distributed graph algorithms by providing concise APIs and optimized graph data management. More application scenarios in Big Data analytics are machine learning.

The MLlib on top of Spark carries out complex machine learning. See http://www.cnblogs.com/zlslch/p/5726346.html for details

Architecture of Apache Spark GRAPHX

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.