Graphchi implementation of approximate diameter algorithm

Source: Internet
Author: User

1. GraphChi1.1 Introduction

Graphchi is designed by Carnegie Mellon University, can be on a single machine on a high-efficiency large-scale computing framework, different from the picture of the information is stored in memory, Graphchi using a single computer massive hard disk storage, because the hard disk and memory access speed gap is very large, To compensate for the pitfalls of using hard disk storage, they designed parallel sliding window technology to reduce the disk's random read and write.

1.2 Parallel Sliding window technology

Divides the entire map into different shards in the order of points, each of which can be completely in memory
Processing. As shown in the following:

The incoming edges in each shard are sorted by source point, which is based on the principle that the edges are distributed across all the shards
And occupy a contiguous space. This updates the data, first in memory and
Storage, followed by successive writes to other shards, which is a good solution to the high latency caused by random reads and writes
Problem, as shown in.

1.3 Programming model

Graphchi continues the point-centric programming model used in Graphlab, which carries user-defined data for nodes and edges in the graph. In each iteration, the labeled points in the same shard call the update function in parallel to update the data: Gets the information carried by the point in the edge, the information carried by the edge, the information carried by the point, through the user-defined calculation logic, the edge and points carried by the information to update. As shown in the following:

2. Approximate diameter algorithm: multiple BFS

With the exact solution of the diameter of the graph, it is necessary to make a BFS from each point to find the farthest distance from the point, the maximum value of the final value is the diameter of the graph. In the case of a large figure, the time cost of this method is intolerable, so it is easy to think of one method is: Select one of the K nodes, respectively, to find the corresponding maximum distance, and then take the largest approximation as the diameter of the graph.

This idea is very intuitive: from the K-node to do the K-Times BFS, but the practice of the time overhead is relatively large, from the K-node, a part of the point set to the K-node distance is the same, from another point of view, from the K-node formation of the BFS search tree may exist between the same path, for this Some of the same paths are actually just a single traversal. Therefore, the method proposed in the paper, in fact, the status of each node to mark, with K-bit binary number to mark which source point of the path has been visited, each iteration process, in fact, each node in the state transfer, for those from the source point has been through the point of the path, not to update the point.

3. Design for multiple BFS based on Graphchi
    • K points Selection: Select the number 0,1,2,4,8 ... , select Log (npoints) points altogether.
    • Point right: The path that is used to store the state of the node from which source points have been accessed.
    • Benquan: Passes the state of the source node through the edge to the destination node.
    • Update function

      • Access all the edges of the point, take the ownership value of the incoming edge or as a new_state, the previous weight of the point as old_state, determine whether old_state completely contains new_state, that is, to determine whether there is a path from the new source point to access the node, if yes, The State of the node is updated.

      • If the state of the node is updated, all the out-of-edge weights for that node are set to the state of the node, and the node to which the edge points is added to the queue for the next iteration.

    • The problems caused by the common use of the boundary value and the right value of the edge
      After the implementation of the above, found that the diameter of the figure and the actual value of the gap is larger, through the analysis of the source code, found that: for the same edge of the corresponding entry and exit, they correspond to the weight of a block of storage address, that is to say: in the edge and out of the weight of the edge is always the same, This introduces a problem in the case of asynchronous execution: In the same iteration, for the same edge, if there is a node to modify the weight of the edge, and the node needs to use the weight of the edge, the weight used at this time is not the weight of the last iteration, but the new weights.

      In order to solve this problem, the concept of two weights is introduced for the weight of the edge: "Current Weight" and "next round weight", in one iteration, access to the incoming edge, using "current weight", and for the modification of the edge, use "next round weight".

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Graphchi implementation of approximate diameter algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.