Open source Diagram Calculation Framework Graphlab Introduction

Source: Internet
Author: User

Graphlab Introduction

Graphlab was developed by the Select Laboratory of CMU (Carnegie Mellon University) in 2010 as an open-source graph computing framework based on the image processing model, implemented using the C + + language development framework. The framework is a parallel computing framework for machine Learning (ML), which can be run on multiprocessor standalone systems, clusters or Amazon EC2. The framework is designed to be highly abstract, like MapReduce, to efficiently perform machine learning-related iterative algorithms with sparse computational dependencies, and to guarantee the high consistency and efficient parallel computing performance of the data during the calculation. The framework was originally developed to handle large-scale machine learning tasks, but the framework also applies to many computational tasks in data mining. In the field of parallel graph computing, the framework has several orders of magnitude in terms of performance on many other parallel computing frameworks (for example, MapReduce, Mahout). Graphlab Since its inception is a rapid development of open source projects, the scope of its users is also quite extensive, the world has more than 2 000 enterprises, institutions using Graphlab.

Advantages of Graphlab

Graphlab, as a parallel computing framework based on graph processing, can efficiently perform machine learning related data dependence, iterative algorithm, and its design has the following characteristics and advantages.

  • A unified API interface. For multicore processors and distributed environments, using a unified API interface, a program can be programmed to run efficiently in a shared memory environment or distributed cluster.
  • Performance. Optimizes the C + + execution engine to strike a good balance between a large number of multithreaded operations and synchronous I/O operations.
  • Scalability is strong. Graphlab is able to intelligently select nodes for storage and computation because Graphlab uses well-designed algorithms for storing and computing data.
  • Integrated HDFs. Graphlab built-in support for HDFs, Graphlab can read data directly from HDFs or write the result data directly into HDFs.
  • A powerful machine learning class toolset. Graphlab implements a large set of out-of-the-box toolsets on top of its own API interface.
Graphlab installation under Windows

Graphlab now does not support windows, and temporarily can only run Linux virtual machines through VMware Player, the official given the configured Graphlab create VM files, can eliminate the steps such as compiling. Download Graphlab Create and configure the installation as required
-Download Graphlab Create VM file First
-then install VMware Player and import the Graphlab Create VM file as detailed in the documentation
-finally through Ipython to see if you can import the Graphlab library see the documentation

Comparison of Graphlab and MapReduce

The General machine learning class algorithm has the following two features:

  • Data dependencies are strong. It is often necessary to exchange large amounts of data between the machines participating in the calculation during the operation.
  • Flow processing is complex. The main performance in the entire processing process requires iterative calculation, data processing branches many, it is difficult to achieve real parallelism.

Prior to the advent of Graphlab, the common programming approach for these machine learning algorithms was to use the existing underlying development libraries of MPI and pthread to accomplish such computational problems. With the development and application of this programming model, it is necessary for the developer to implement the corresponding algorithm to complete the bottom operation of host communication and data synchronization among cluster compute nodes during the calculation. The advantage of this development approach is that the code can be deeply optimized for specific applications to achieve high performance. However, for different applications, the need to rewrite the code to achieve the underlying data distribution, communication and other details, which led to the code reuse rate is very low, scalability is poor, high demand for programmers. This programming model is obviously not suitable for today's agile Internet development. The currently widely used MapReduce computing framework, when multitasking is performed in parallel, requires separate tasks, which do not require data communication between each other during task execution, so mapreduce is not suitable for tasks that are strongly The parallel computing model also cannot express iterative algorithms efficiently. The computational model has obvious advantages in dealing with the task of data independence such as log analysis, data statistics and so on, but the MapReduce framework does not meet the Machine learning Computing task well in machine learning field.

Graphlab is not an alternative to the mapreduce algorithm, instead, Graphlab uses the idea of MapReduce to generalize the MapReduce parallel computing model to the field of data overlap, data dependence and iterative algorithm. Essentially, Graphlab fills the gap between highly abstracted mapreduce parallel computing models and underlying message delivery, multithreaded models such as MPI and Pthread.

The current popular parallel computing framework, MapReduce, abstracts the parallel computing process into two basic operations, the map operation and the reduce operation, and divides the jobs into separate tasks in the map phase for parallel processing on the cluster. In the reduce phase, the output of the map is combined to obtain the final output result. The Graphlab simulates the abstraction process in MapReduce. The map operation of MapReduce is simulated by the process called update function, which is able to read and modify user-defined graph structure datasets. the data graph provided by the user represents the memory state associated with the vertex and edge of the program, and the update function can trigger the update operation recursively, thus making the update operation perform dynamic iterative calculation on the other graph nodes. Graphlab provides a powerful control primitive to guarantee the order in which the update functions are executed. Graphlab The reduce operation of MapReduce is also simulated by a process known as sync operation. synchronous operations can perform a merge (reductions) during a background computing task, and as with the update functions provided by Graphlab, synchronous operations can simultaneously process multiple records concurrently, ensuring that synchronization operations can run in a large, independent environment.

Graphlab Parallel Framework

Graphlab abstracts the data into graph structure and abstracts the execution process of the algorithm into gather, Apply, scatter three steps. The core idea of parallelism is the segmentation of vertices.

In the example, the summation of the V0 adjacency vertex needs to be completed, and in the serial implementation, V0 iterates through all its adjacency points, summing up the sum. In Graphlab, the vertex V0 is segmented, the edge relationship of V0 and the corresponding adjacency point are deployed on two processors, the partial summation is performed in parallel on each machine, and then the final calculation is done through the communication between the master vertex and the mirror vertex.

The structure of graph


Vertex is its minimum parallel granularity and communication granularity, and Edge is the representation of data dependence in machine learning algorithm.
For a vertex, it is deployed to multiple machines, one machine as the master vertex and the rest on the machine as mirror. Master, as the manager of all Mirror, is responsible for scheduling specific computing tasks for mirror, and mirror as the agent performer on each machine for this vertex, keeping in sync with master data.
For an edge, the Graphlab only deploys it on one machine, and the vertices associated with the edge are stored in multiple copies, which can solve the problem of the large amount of edge data.
All the edge and vertex on the same machine make up local graph, and on each machine there is a mapping table with local ID to the global ID. Vertex is shared by all threads on a process, and during parallel computations, each thread is allocating gather->apply->scatter operations for all vertices in the process.

The execution model of Graphlab


Each iteration of each vertex passes gather->apple->scatter three stages.

    1. Gather stage
      The edge of the work vertex (which may be all edges, or possibly the edge or the out edge) collects data from the receiving vertex and itself, as gather_data_i, and the data graphlab of each edge is summed and recorded as Sum_data. This stage is read-only for work vertices and edges.
    2. Apply Stage
      Mirror sends the result of the gather calculation Sum_data to the master vertex, and master summarizes it to total. Master uses the vertex data from total and the previous step to further calculate the business requirements, then updates the master vertex data and synchronizes the mirror. In the Apply stage, the work vertex can be modified and the edges cannot be modified.
    3. Scatter stage
      After the work vertex update is complete, the data on the edge is updated and the neighbor Vertex update status is notified that it has dependencies. In this scatter process, the work vertex is read-only and the data on the edge is writable.

In the execution model, Graphlab achieves mutual exclusion by controlling three-phase read and Write permissions. In the gather phase is read only, apply to the vertex write only, scatter to the side write only. The synchronization of parallel computing is achieved through master and mirror, and mirror is equivalent to an interface person for each vertex, which abstracts complex data traffic into vertex behavior.

Resources

Graphlab: A new parallel framework for machine learning
Graphlab Baidu Encyclopedia
Easily deal with terabytes of data, open source Graphlab breakthrough human Graph Computing "limit value"
Graphlab: Applying Big data analytics from ideas to production

reprint Please indicate the author Jason Ding and its provenance
Gitcafe Blog Home page (http://jasonding1354.gitcafe.io/)
GitHub Blog Home page (http://jasonding1354.github.io/)
CSDN Blog (http://blog.csdn.net/jasonding1354)
Jane Book homepage (http://www.jianshu.com/users/2bd9b48f6ea8/latest_articles)
Baidu Search jasonding1354 access to my blog homepage

Open source Diagram Calculation Framework Graphlab Introduction

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.