The Apache Tez framework opens the door to a new generation of high-performance, interactive, distributed data-processing applications.
Data can be said to be the new monetary resources in the modern world. Enterprises that can fully exploit the value of data will make the right decisions that are more conducive to their own operations and development, and further guide customers to the other side of victory. As an irreplaceable large data platform on the real level, Apache Hadoop allows enterprise users to build highly scalable and cost-effective data storage systems. Companies can use this data to run large-scale parallel and high-performance analysis of the workload, and then solve the long time by technology or economic cost constraints and dust-laden in the depths of the guidance of the conclusions. Hadoop can deliver data value on unprecedented scale and efficiency-thanks in large part to the support of Apache Tez and yarn.
Analytical applications handle data in a goal-driven direction, so different types of business problems or differentiated vendor product designs can bring differentiated characteristics to the process. To create a destination-driven application for Hadoop data access, you first need to meet two major prerequisites. First, users ' operating systems (similar to Windows or Linux) must be able to host, manage, and run these applications in a shared Hadoop environment. Apache yarn is a data operating system that is designed for Hadoop. The second prerequisite is that developers need a set of application-building frameworks and universal standards that can be written to run on yarn data access applications.
The Apache Tez just satisfies these two decisive factors. Tez is an embeddable and extensible framework that allows for the integration of yarn and allows developers to write native yarn applications that cover a wide variety of interactive bulk workloads. Tez leverages the unmatched power of Hadoop to process petabytes of data sets to ensure that the various projects in the Apache Hadoop ecosystem achieve the data processing logic that is consistent with the purpose, the quick response time, and the extreme throughput. Tez can bring unprecedented processing speed and scalability to Apache projects such as Hive and pig, and gradually become an important prerequisite for the design effect of targeted third party software applications that are specifically designed for high-speed interaction with data stored in Hadoop.
Hadoop in the post-MapReduce era
Friends who are familiar with MapReduce must be eager to know what tez unique ability to differentiate. Tez is a broad and more powerful framework that, in addition to inheriting the advantages of MapReduce, modifies some of the inherent limitations of the latter. The advantages of Tez from MapReduce include the following:
• Horizontally scalable capabilities, including increased data size and computational capacity.
• Have a resource resilience mechanism that can function properly at the same time with ample or limited capacity.
• Ideal fault-tolerant effect and recovery capability for all kinds of unavoidable and multiple faults in distributed systems.
• Secure data processing with built-in Hadoop security mechanisms.
But Tez itself does not belong to the processing engine. Instead, the role of Tez is to help users build applications and engines through their own flexibility and customizable advantages. Developers can use the Tez library to write MapReduce tasks, and Tez code, placed in MapReduce, can combine the efficiency of the former with the existing tasks of the latter, ultimately achieving a mapreduce process of improvement.
MapReduce was (and certainly is) the ideal choice for those who simply intend to try out the Hadoop experience only initially. But now that enterprise-class Hadoop applications have become a reality, the widely accepted platform is helping more and more users exploit the data stored in their internal clusters to scoop out the biggest business value, and the associated investment effort is growing. In view of this, customized applications are beginning to replace the various general-purpose engines represented by MapReduce, designed to achieve greater resource utilization and performance improvements.
The design concept of Tez framework
The Apache Tez is optimized for custom data processing applications running in Hadoop. It is able to organize data processing into a set of database flowchart models, so that the various projects in the Apache Hadoop ecosystem can be used to meet the requirements of human-computer interaction for response time and PB-level extreme data throughput. Each node in the data flow diagram represents a portion of the business logic that is specifically responsible for the corresponding data transfer or analysis work. The connections between the different nodes represent the entry and exit of data between different transmission systems.
Once the application logic is determined through this flowchart, Tez will parallel the logic and execute it in Hadoop. If a data processing application can be modeled in this way, it means that the user can build it using Tez. Extraction, transmission and load (ETL) tasks are ubiquitous in the Hadoop data processing system, and any custom ETL application is ideal for Tez. Other projects suitable for the TEZ framework include the query processing engine-such as the Apache hive--and scripting languages-such as Apache Pig, in addition to the language integration and data processing APIs cascadig for Java and scalding for Scala.
When used in conjunction with other Apache projects, the Tez framework allows you to perform more productive processing tasks. The combination of Apache hive and Tez can provide Hadoop with excellent performance for High-performance SQL, while the Apache Pig and Tez combine to optimize the large-scale, complex ETL tasks in Hadoop. Cascading and scalding meeting the TEZ framework will greatly improve the translation efficiency of Java and Scala code.
The Tez framework contains an intuitive Java API that helps developers more easily create unique data-processing flowcharts to maximize application execution efficiency. After a set of processes is defined, the TEZ framework can incorporate additional APIs into the custom business logic and make it run within the task flow. These APIs combine input information in a modular environment (that is, read data), output information (that is, write data), and processing mechanisms (that is, processing data). Consider this as a Lego building block in the field of data analysis.
Applications built with these APIs can run efficiently in Hadoop environments, while the Tez framework handles complex interactions with other stack components. As a result, we have a custom-optimized and natively integrated application with yarn, with excellent execution efficiency, scalability, fault tolerance, and the ability to secure security in a multi-tenant Hadoop environment.
Application of Tez Framework
As a result, enterprise users can use the Tez framework to create intent-driven profiling applications in Hadoop. When choosing this implementation, you can use two different types of application customization in Tez: either defining the data flow or customizing the business logic.
The first step is to define the data flow to address the related challenges. You can use a variety of data flow diagrams to achieve the same results, but from which the most ideal solution can greatly improve the performance of the application. For example, the performance of Apache hive can be significantly enhanced with the support of the best connection graph built using the Tez API.
Then, if the data processing process has been determined, the enterprise user can also adjust the input information, output information and processing mechanism of the task execution to realize the custom design of the business logic.
It should be noted that, in addition to enterprise users can customize the data processing application design, Internet service providers and other vendors can also use the TEZ framework to achieve their unique value proposition. For example, storage service providers can implement customized input and output implementations for their storage services. If a vendor has more advanced hardware configurations-such as RDMA or InfiniBand connectivity-then they will be able to more easily introduce the optimization into the existing business implementation system.
Big Data has a bright and even explosive development prospect, and the task of data capture, storage, and processing, implemented by Apache Hadoop, is bound to spawn a vast and diverse variety of new forms of expression. Because of its excellent performance in cost-cutting, complexity control and large data management risk buffering, Hadoop has become a key player in the modern data architecture-a major component of the enterprise-class data warehouse.
The advent of Apache Tez has enabled Hadoop to be further improved in terms of applicability, and has been able to carve out more new, target-driven application categories while meeting existing usage requirements. The Tez framework opens the door to a new generation of large data racks, making it possible to create performance-friendly interactive applications in Hadoop without abandoning existing processes or application scenarios.
Original link: http://www.infoworld.com/article/2690634/hadoop/hadoop-batch-processing-mapreduce.html