At the 2013 Hadoop Summit, yarn was a hot topic, yarn the new operating system of Hadoop, breaking the performance bottleneck of the MapReduce framework. Murthy that the combination of Hadoop and yarn is the key to the success of a large data platform for enterprises.
Yahoo! originally developed Hadoop to search and index Web pages, and many search services are currently based on this framework, but Hadoop is essentially a solution. Yarn was a hot topic at the 2013 Hadoop Summit. Three years of gestation, yarn is essentially Hadoop's operating system, breaking through the performance bottlenecks of the MapReduce framework.
MapReduce is the main mechanism for manipulating data under HDFs. This is a good choice for processing and analyzing massive amounts of data, such as log files and other semi-structured data for many years, but it is not suitable for other types of data analysis. Three years ago, Hortonworks's founder and architect, Arun Murthy, began restructuring Hadoop (Hortonworks has just announced 50 million dollars in a new round of financing, Tenaya Capital and Dragoneer Investment group dominates this round of financing, with former investors benchmark Capital, Index Ventures and Yahoo! also involved in it, to make it a more versatile large data platform.
"When we start building Hadoop2.0, we want to radically redesign the Hadoop architecture to achieve the purpose of running multiple applications on Hadoop and working with related datasets," Arun Murthy said. This allows multiple types of applications to run on the same cluster efficiently and controllable. This is the real reason why Apache yarn, based on Hadoop 2.0, can be born. Using yarn to manage cluster resource requests, Hadoop upgrades from a single application system to a multiple-application operating system. ”
Murthy other types of applications include: machine learning, image analysis, streaming analysis, and interactive query functions. Once yarn is fully operational, developers will be able to use the yarn "operating system" to apply the data stored in HDFS to these applications. Hive is the HDFS SQL Type Data Warehouse tool developed by Facebook, but the back-end data processing is through MapReduce. Hive consumes resources and affects other jobs that run concurrently. Other Hadoop-related data analysis subprojects are similar.
Yarn is a true Hadoop resource manager that allows multiple applications to run simultaneously and efficiently on one cluster. With Yarn,hadoop will be a truly multiple application platform that can serve the entire enterprise. Murthy says yarn can interact with data in an unprecedented way, yarn has been used for hortonworks data platforms, and the combination of Hadoop and yarn is key to the success of large data platforms.
The basic architecture of Mapreduce2.0--yarn
MapReduce has undergone a massive update at Hadoop 0.23, and the new version of MapReduce2.0 is known as yarn or MRv2.
The basic idea of yarn is to separate the two main functions of jobtracker (Resource management and job scheduling/monitoring) by creating a global ResourceManager (RM) and several applicationmaster (AM) for applications. The application here refers to the traditional MapReduce job or job dag (with a direction-free loop diagram).
The ResourceManager and NodeManager (NM) of each slave node constitute the data computing framework. ResourceManager is responsible for ultimately allocating resources to individual applications. NodeManager is the framework agent for each machine, responsible for managing containers, monitoring their resource usage (CPU, memory, hard disk, network), and reporting to Resourcemanager/scheduler. Applicationmaster for each application is actually a detailed framework library that combines the resources obtained from ResourceManager and NodeManager to run and monitor tasks. Applicationmaster is also responsible for requesting appropriate resource containers from the scheduler, tracking their usage status and monitoring their progress.
There are two main components in the ResourceManager: Scheduler and Applicationsmanager.
Scheduler is responsible for allocating resources to the application. Scheduler is, in a sense, a pure dispatch that does not monitor and track the state of an application, nor is it responsible for restarting applications or failures caused by hardware failures. Scheduler executes schedules based on the application's resource requirements, which are based on an abstract resource concept container, including memory, CPU, hard disk, and network. Applicationsmanager is responsible for receiving job submissions, assigning applications to specific applicationmaster, and is responsible for restarting failed applicationmaster.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.