[Go] DAG algorithm application in Hadoop

Source: Internet
Author: User

http://jiezhu2007.iteye.com/blog/2041422

University inside the data structure there is a special chapter of the graph theory, unfortunately did not study seriously, now have to pick up again. It's an idle youth, needy age! What is a dag (Directed acyclical Graphs), take a look at the textbook definition: If a directed graph is unable to go from one vertex to another, go back to that point by several edges. Let's take a look at which Hadoop engines the DAG algorithm is now applied to.

Tez:

The DAG Computing framework developed by Hortonworks is a universal DAG computing framework evolved from the MapReduce Computing framework, and the core idea is to further divide the map and reduce two operations by splitting the map into input, Processor, Sort, Merge and output, reduce is split into input, Shuffle, Sort, Merge, processor, and output, so that these decomposed meta-operations can be arbitrarily and flexibly combined, resulting in new operations, which after some control procedures are assembled, Can form a large DAG job that can be used to replace Hive/pig and so on.



Oozie:

The Oozie workflow is a set of actions placed in a control-dependent dag (directed acyclic graph Direct acyclic graph) that specifies the order in which the actions are executed, such as map/reduce jobs for Hadoop, pig jobs, and so on. We will use HPDL (an XML Process definition language) to describe this diagram.

HPDL is a very concise language that uses only a handful of process control and action nodes. The control node defines the process of execution and includes the starting and ending points of the workflow (start, end, and fail nodes) and the mechanism that controls the execution path of the workflow (decision, fork, and join nodes). Action nodes are mechanisms through which workflows trigger the execution of calculations or processing tasks. Oozie provides support for the following types of actions: The Hadoop map-reduce, the Hadoop file system, the Pig, Java, and Oozie child workflows.

Spark:

The resilient Distributed DataSet (RDD) Elastic distribution DataSet is the most basic abstraction of spark and an abstraction of distributed memory, implementing an abstract implementation of distributed datasets in a way that operates local collections. The RDD is the core of Spark, which represents a collection of data that has been partitioned, immutable, and can be manipulated in parallel, with different data set formats corresponding to different RDD implementations. The RDD must be serializable. The RDD can be cache into memory, and the results of each operation on the RDD dataset can be stored in memory, and the next operation can be entered directly from memory, eliminating the mapreduce bulk of disk IO operations.

The structure of the metadata is a dag (directed acyclic graph), where each "vertex" is an RDD (including the operator that produces the RDD), from the parent Rdd to the child Rdd with an "edge" that represents the dependency between the RDD. Spark gives the metadata dag a cool name, lineage (lineage).

The running scenario for the spark program. It is initiated by the client, divided into two stages: the first stage records the transformation operator sequence, the increment constructs the Dag graph, the second stage is triggered by the action operator, dagscheduler the Dag graph into the job and its task set. Spark supports local single-node running (useful for development debugging) or cluster operation.


[Go] DAG algorithm application in Hadoop

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.