Yarn shook MapReduce's grip on Hadoop.

Source: Internet
Author: User
Keywords Execute control large data

Hadoop is considered to be a mapreduce running on HDFS (Distributed File System). Increased number of potential applications through Yarn,hadoop 2.0.

Hadoop has always been a general term for all kinds of open source innovations that are more or less integrated into the unified large data architecture. Some people believe that the core of Hadoop is a distributed File System (HDFS), while a series of alternative HDFS databases such as HBase and Cassandra are shaking the claim.

Hadoop used to have a special job execution layer--mapreduce, which executes on one or more alternative large, parallel data persistence layers, one of which is HDFs. But the recent generation of implementation layers for Hadoop, yarn (another resource coordinator), eliminates the strict subordination of the mapreduce Hadoop environment.

Crucially, yarn eliminates a bottleneck that restricts the execution of MapReduce operations from the outset. Before yarn, all mapreduce jobs must be run by a daemon (i.e. Jobtracker) as a batch program, limiting their extensibility and processing speed. These mapreduce constraints forced many manufacturers to find ways to improve their speed to avoid the mapreduce inherent bottlenecks, IBM's re-use MapReduce is the representative.

All of this may make people wonder how "Hadoop" differs from other large data and analysis platforms and tools on the stack. Yarn is a fundamental component of large data development. Yarn traditional Hadoop into a scalable, fit-for-purpose (Fit-to-purpose) platform to handle data management, analytics, and transactional computing.

Yarn transforms Hadoop into a generic distributed job execution layer as mentioned in the initial definition of open source innovation. Although they retain the backward compatibility of the MapReduce API and continue to perform mapreduce jobs, the yarn engine can perform a large number of jobs developed by other languages.

Importantly, yarn can be a unified thread for different Apache Open-source innovations for large data. As the US infoworld.com website recently pointed out: "The biggest victory is that MapReduce itself becomes a possible way to exploit large data using Hadoop." ”

This is the goal of yarn, but to achieve this requires the industry to redesign the Hadoop stacks and tools that work with it. In its official statement, the Apache organization said: "Any distributed application can be run on yarn by porting." For this purpose, Apache maintains a list of yarn-compatible applications, such as the social graph analysis system that Facebook is using, Apache giraph. Other parts will be the same. ”

This may sound good, but notice the relevant disclaimer: the word "transplant". In its statement, the Apache group said the yarn test would achieve the level of the manufacturer's porting of the analysis and development tools to the yarn output. Porting the development language to yarn is not a trivial matter.

Will this continue to occur throughout the industry and in different Apache communities and other open source communities? If so, how wide is it? These factors will determine the acceptance of yarn, the Hadoop 2.0 specific feature. Given that Hadoop 2.0 retains mapreduce compatibility, yarn needs to keep mapreduce applications up-to-date. This could drastically reduce the speed at which developers accept this new framework.

In addition, from the alternative language (R language) and alternative platform (any NoSQL solution) used for large data application development, it is still unclear whether the Hadoop 1.0 or 2.0 version can sustain its current development momentum for a long time.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.