Apache Tez Learn

Source: Internet
Author: User
Tags shuffle

you may have heardApache Tez, it is aHadoopa new distributed execution framework for data processing applications. But what exactly is it? How does it work? Who should use it and why? If you have these questions, you can checkBikas Sahaand theArun MurthyThe rendering provided "Apache Tez:AccelerationHadoopQuery Processing"and in this presentation they discussed theTezthe design, it's some highlight highlights, while also sharing the by lettingHiveUseTezand notMapReduceand some of the initial results obtained.

Tez is the latest open-source computing framework for DAG jobs in Apache, which transforms multiple dependent jobs into a single job to significantly improve the performance of DAG jobs. Tez is not directed towards end users-in fact it allows developers to build applications that are faster and more scalable for end users. Hadoop is traditionally a large batch of data processing platforms. However, there are many use cases that require near real-time query processing performance. There are also some jobs that are less suited to mapreduce, such as machine learning. The purpose of Tez is to help Hadoop deal with these use case scenarios.


The goal of the Tez project is to support a high degree of customization so that it can meet the needs of a variety of use cases, allowing people to do their work without the need for other external means, and if projects such as Hive and Pig use Tez rather than mapreduce as the backbone of their data processing, They will significantly increase their response time. Tez is built on yarn, which is the new resource management framework used by Hadoop.

Design philosophy

The main reason for Tez is to circumvent the restrictions imposed by MapReduce. In addition to having to write mapper and reducer restrictions, forcing all types of computations to satisfy this paradigm is also inefficient-for example, using HDFS to store temporary data between multiple Mr Jobs, which is a load. In hive, it is common for a query to have multiple shuffle operations on unrelated keys, such as Join-grp By-window Function-order by.

Key elements within the Tez design philosophy include:

    • Allow developers (including end users) to do what they want in the most efficient way
    • Better execution performance

The reason that Tez is able to achieve these goals depends on the following:

  • Expressive Data Flow The Api--tez team wants to enable users to describe the directed acyclic graph (DAG) that they want to run calculations with a set of expressive data flow definition APIs. To achieve this, Tez implements a structured type of API where you can add all the processors and edges, and visualize the actual built-in graphics.
  • Flexible input-processor-output (input-processor-output) runtime model--you can dynamically build runtime actuators by connecting different inputs, processors, and outputs.
  • Data type Independence-care only about the movement of the data, not the data format (key-value pairs, tuple-oriented formats, etc.).
  • Dynamic graph Reconfiguration
  • Simply deploying--tez is completely a client application that leverages yarn's local resources and distributed cache. In terms of the use of Tez, you do not need to deploy anything on your own cluster, just upload the relevant Tez libraries to HDFS, and then use the Tez client to submit these class libraries.

    You can even place a two-part class library on your cluster. One for the product environment, it uses a stable version for all production tasks, and the other uses the latest version for user experience. These two libraries are independent of each other and do not affect each other.

  • Tez is able to run arbitrary Mr Tasks without any changes. This enables distribution migrations for tools that now rely on Mr.

Let's explore these expressive data streams in detail api--see what we can do with them? For example, you can use the MRR mode instead of multiple mapreduce tasks, so that a single map can have multiple reduce stages, and the traffic can flow between different processors without having to write any content to HDFs (which will be written to disk, But this is just to set up checkpoints, which is a significant performance improvement compared to the previous one. The following diagram illustrates this process:

The first chart shows a process that contains multiple Mr Tasks, each of which stores intermediate results on HDFS-reducer in the previous step provides data for the mapper in the next step. The second chart shows the process when using Tez, where the same process can be done in only one task, without the need to access hdfs between tasks.

The flexibility of tez means that you need to pay more than mapreduce to use it, you need to learn more about the API and need to implement more processing logic. But it's okay, after all, it's not an end-user-oriented application like MapReduce, and it's designed to allow developers to build applications based on it for end-user use.

The above is a description of the Tez's overview and its objectives, so let's look at its actual API.

Tez API

The Tez API consists of several components:

  • Directed acyclic graph (DAG)-Defines the overall task. A Dag object corresponds to a task.
  • Node (Vertex)-Defines the user logic and the resources and environments required to execute user logic. A node corresponds to a step in the task.
  • side needs to be allocated Properties, which are necessary for Tez to expand the logical diagram into a collection of physical tasks that can be executed in parallel on the cluster at run time. Here are some of these properties:

    • Data movement properties that define how data is moved from one producer to one consumer.
    • dispatch (scheduling) attributes (sequential or parallel) , which helps us define when the scheduling between producer and consumer tasks should be made.
    • Data source properties (persistent, reliable, or temporary), Define the life cycle or persistence of the task output, allowing us to decide when to terminate.

If you want to see an example of how an API is used, a detailed description of these properties, and how the runtime expands the logical diagram, you can look at the article provided by Hortonworks.

The runtime API is based on the input-processor-output model, with all the inputs and outputs of the model pluggable. For convenience, Tez uses an event-based model to enable communication between tasks and systems, components, and components. Events are used to pass information, such as task failure information, to the required components, to transmit the output of the data stream (such as the generated data location information) to the input, and to make changes to the DAG execution plan at run time.

Tez also offers a variety of out-of-the-box input and output processors.

These expressive APIs enable writers of higher-level languages, such as hive, to gracefully transform their queries into Tez tasks.

Tez Scheduler

When deciding how to assign tasks, the Tez scheduler takes into account many aspects, including: task location requirements, container compatibility, the total amount of resources available to the cluster, the priority of waiting for a task request, automatic parallelization, freeing resources that the application is no longer using (because data is not local to it), and so on. It also maintains a warm-up JVM connection pool that uses shared registration objects. Applications can choose to use these shared registration objects to store different types of precomputed information so that they can be reused at later times without recalculation, and these shared connection collections and container pool resources can run tasks very quickly.

If you want to learn more about container re-use, you can check here.

Scalability

In general, Tez provides developers with extensive extensibility to enable them to deal with complex processing logic. This can be explained by the example "how hive is used by Tez".

Let's take a look at this classic Tpc-ds query pattern in which you need to connect multiple dimension tables to a fact table. Most of the optimizer and query systems are able to complete the scenario described in the upper right-hand corner of the graph: if the dimension table is small, you can broadcast all the dimension tables to the larger fact table, in which case you can do the same thing on the Tez.

But what if these broadcasts contain user-defined, computationally expensive functions? At this point, you can't all be implemented this way. This requires you to separate your tasks into different stages, as shown in the topology diagram on the left side of the diagram. The first dimension table is broadcast connected to the fact table, and the result of the connection is then broadcast to the second dimension table.

The Third dimension table no longer broadcasts the connection because it is too large. You can choose to use the shuffle connection, and Tez is able to navigate the topology very efficiently.

The benefits of using Tez to complete this type of hive query include:

    • It provides sessions and reusable containers, so the latency is low, Be able to avoid reorganization as much as possible.

Using the new Tez engine to perform this special hive query performance increase will exceed 100%.

Road Map
    • Richer DAG support. For example, can samza use Tez as its underlying support and build applications on it? In order for Tez to be able to handle Samza's core scheduling and streaming requirements development teams need to do some support. The Tez team will explore how these types of connection patterns are used in our dags. They also want to provide better fault-tolerant support for more efficient data transfer, which further optimizes performance and improves session performance.
    • Given the uncertainty of the complexity of these dags, there is a need to provide many automated tools to help users understand their performance bottlenecks.
Summarize

Tez is a distributed execution framework that supports DAG jobs. It can easily map to more advanced declarative languages such as Hive, Pig, cascading, and so on. It has a highly customizable execution architecture that allows us to perform dynamic performance optimizations at runtime based on real-time information related to data and resources. The framework itself automatically determines a number of tricky issues, allowing it to run smoothly and correctly.

With Tez, you can get good performance and efficiency out of the box. The goal of Tez is to address some of the issues facing Hadoop data processing, including latency and the complexity of execution. Tez is an open source project and has been used by hive and pig.

Apache Tez Learn

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.