As the concept of large data has warmed up, Hadoop has been in the people's eye for some time as the most representative technology. The entire Hadoop ecosystem is developing at a rapid pace, with new features or new tools being generated almost every day. Although there are some minor changes, such as the more perfect support for scheduling in Oozie, or some still in development, such as support for NFS. There are also some very cool features, such as providing complete support for cpython in pig. But in my opinion, these are not as revolutionary as the yarn in Hadoop 2.0.
As we all know, Hadoop has two core components, namely HDFS (Distributed File System) and MapReduce architecture (distributed processing Platform). The advent of yarn, in turn, transforms Hadoop from a distributed processing architecture into a distributed operating system.
The word "operating system" may be a lot of users spit slot, is not exaggerated point? Here I would like to quote computer Daniel Andrew S.tanenbaum's definition of the OS in the book Modern operating system, which includes:
a virtual machine: the role of the operating system is to provide users with an extended computer or virtual machine environment, which makes it easier for users to program in the underlying hardware a resource manager: The operating system is to provide an orderly, controllable allocation of resources for processors, memory, IO devices, etc. For use by various programs.
For the first condition, Hadoop has provided support from the start in version 1.0. The appearance of yarn satisfies the second condition. So it seems to me that Hadoop can now be considered a distributed operating system.
Yarn is the resource manager for Hadoop. The previous Hadoop was built on the basis of MapReduce, and although there have been many attempts at computational paradigms, it is still not out of the framework provided by MapReduce. It handles workloads in the form of Jobtracker and Tasktracker and manages server resources, each of which is configured with map and reduce.
In Hadoop 2.0, MapReduce is no longer the only option. It provides a better, more flexible design that separates computational resource processing. For yarn, you can go to the Hortonworks architect Arun Murthy's series of articles, which are very detailed.
It is important to emphasize that the computational resource processing separation mentioned above has been applied to the actual environment and has yielded good results, such as:
Storm on Yarn:twitter The Flow computing framework running on Hadoop (Yahoo) Apache Samza: A project based on YARN, which can be used as a substitute for Storm (Apache) hoya:hbase on YARN , the HBase Deployment Tool (Hortonworks) on the cluster Weave: yarn based encapsulation, Used to simplify application deployment (continuuity) Giraph: Graphics processing System (Apache) Llama: Framework (Cloudera) Spark on Yarn for external servers to obtain resources from Yarn: A memory-based analysis cluster technology tez: Generic, Highly customizable framework for simplifying the creation of data processing tasks in Hadoop, support for small-scale (low latency) and large (high throughput) load (Hortonworks)
In my opinion, the introduction of yarn is revolutionary to Hadoop and can change the rules of the game. And it has moved from theory to practice, not some sort of future concept. In fact, Hadoop 2.0 is officially GA, and users can test yarn directly.