GraphX diagram data Modeling and storage

Source: Internet
Author: User
Tags bitset

Background

Simply analyze how GRAPHX is modeled and stored for graph data.

Entrance

Can see the function of Graphloader ,

def edgeListFile(      sc: SparkContext,      path: String,      false,      numEdgePartitions: Int = -1,      edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,      vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY)    : Graph[Int, Int]
    1. Path can be a local path (file or folder) or an HDFs path, essentially using sc.textfile to generate Hadooprdd,numedgepartitions is the number of partitions.
    2. Graph storage is divided into Edgerdd and Vertexrdd two blocks, can be set storagelevel respectively. The default is memory.
    3. This function accepts the edge file, which is the ' 1 2 ', ' 4 1 ', which is composed of point-to data pairs of files. Convert this file into a graph that can be manipulated by partition number and storage level.
Process
    1. sc.textfile Read the file to generate the original RDD
    2. Each partition (the compute node) puts each record in the Primitivevector , which is the storage structure optimized for primitive data in spark.
    3. The data in the Primitivevector is taken out and transformed into a edgepartition, that is, Edgerdd partition implementation. In this process, a column-oriented structure is generated: The Array,edge attribute array for the array,dst point of the SRC point, and two forward and backward maps (local ID and global ID for the corresponding point).
    4. Do a count on Edgerdd to trigger this side modeling task, really persist up.
    5. Use edgepartition to generate a routingtablepartition, which is the correspondence between Vertexid and PartitionID, with Routingtablepartition generates Vertexrdd.
    6. Graphis generated by Edgerdd and Vertexrdd . The former maintains the attributes of the edges, the attributes of the vertices at the edges, the respective global vertexid of the two vertices, the respective local IDs of the two vertices (the array index in an edge partition), and the forward and backward map used to address the array. The latter maintains a map of the partition on which the point exists on the edge.

Here is the code that shows the internal storage structure more clearly.

private  [graphx]< Span class= "Hljs-class" >class  edgepartition  [   @specialized  (Char, Int, Boolean, Byte, Long, Float, Double) Ed:classtag, Vd:classtag] (Localsrcids:array[int], Localdstids:array[int], data:array[ed], Index:graphxprimitivekeyope    NHASHMAP[VERTEXID, int], GLOBAL2LOCAL:GRAPHXPRIMITIVEKEYOPENHASHMAP[VERTEXID, int], LOCAL2GLOBAL:ARRAY[VERTEXID], VERTEXATTRS:ARRAY[VD], Activeset:option[vertexset]) extends  Serializable { 
/** * Stores the locations of edge-partition join sites for each vertex attribute in a particular * vertex partition. This provides routing information for shipping vertex attributes to edge * partitions. */private[graphx]class RoutingTablePartition(    private val routingTable: Array[(Array[VertexId], BitSet, BitSet)]) extends Serializable {
Detail partition Placement

How do the partitions of the Edgerdd be sliced? Because the data is based on the Hadooprdd from the file according to offset sweep out. It can be understood that the segmentation of the edge data is not processed, because the file is not specifically arranged, so how many partitions to cut into should be random.

How do the partitions of the Vertexrdd be sliced? Edgerdd generated Vertexidtopartitionid This RDD data is Rdd[vertexid, Int] type, which is divided into and Edgerdd according to the hash partitioning rules. As large as the number of partitions. So Vertexrdd partition number and edge, the partition rule is long hash.

So I can imagine the calculation process is:

On the point of operation, the first VERTEXID (is a long) hash, find the location of the corresponding partition, on this partition, if it is memory storage vertexrdd, it can quickly find the edge of its side where the location of the several edge partitions, The calculation is then divided into the partitions where the edge is located.
The first step based on the point hash to find the edge of the location of the process is similar to a query to build the index.

With the official map of understanding:

Efficient Data Structures

There is a better data structure support for native type storage and reading and writing, typically the map used in edgepartition :

/** * A fast hash map implementation for primitive, non-null keys. This hash map supports * insertions and updates, but not deletions. This map is about an order of magnitude * faster than java.util.HashMap, while using much less space overhead. * * Under the hood, it uses our OpenHashSet implementation. */private[graphx]class GraphXPrimitiveKeyOpenHashMap[@specialized(Long, Int) K: ClassTag,                              @specialized(Long, Int, Double) V: ClassTag](

And the previously mentioned vector

/** * An append-only, non-threadsafe, array-backed vector that is optimized for primitive types. */private[spark]class PrimitiveVector[@specialized(Long, Int, Double) V: ClassTag](initialSize: Int = 64) {  privatevar0  privatevar _array: Array[V] = _

Complete the full text:)

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

GraphX diagram data Modeling and storage

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.