3 o'clock in the morning, Arun C. Murthy was woken up by a phone call and the company asked him to deal with a software bug urgently. At the time, he was a Yahoo Ad app engineer, and the app was slow to run because a bunch of software code was poorly written when app enabled the open source digital platform Hadoop. No one would have thought that this little bug, a few years later, contributed to the birth of the official Hadoop 2.0 and changed the fortunes of Hadoop.
Although it was written by someone else, Murthy's job is to fix it. No one would have thought that this little bug, a few years later, generated a whole new path for Hadoop, a software system almost equivalent to a large data concept.
Today, Hadoop is used in many companies, such as Facebook, Twitter, EBay, Yahoo, and so on, but 2007 years ago, it was not so capable before the call.
Doug Cutting joins Yahoo
Influenced by the Google 2004 white paper, Doug Cutting and Michael Cafarella created the Hadoop platform a year before the phone call and Doug Cutting joined Yahoo. Murthy is called to continue to study Yahoo's Hadoop problem because he has more experience with the system's software.
He looked at the invitation and said, "Who the hell is going to use Java to write system software?" "But then it was accepted, but that night, he continued to curse," What the hell am I doing debugging someone else's Hadoop code? "But then he found himself in a deeper curse because he found that the processed apps (ad location app) didn't really run Hadoop.
Hadoop is actually a two-part software platform, a storage system called the Hadoop Distributed File System (HDFS), a processing system called MapReduce. You can dump a lot of data in the system and then distribute it in dozens of, hundreds of, thousands of servers, and then split the big problem into small problems with MapReduce in the cluster. This is the charm of Hadoop: You can save money by using a lot of cheap commodity servers, rather than buying a handful of expensive supercomputers.
One small problem, though, is that sometimes developers want to pull data out of one of the clusters without running the entire mapreduce, which was the problem with Yahoo's ad-targeting app, when the first thing that Murthy felt was that Hadoop needed another system.
Murthy's first feeling is that Hadoop needs another system
When the bug was solved by temporary means, he began plotting how to solve the big bug. Since 2008-2010, the Hadoop team has been focusing on how to improve the security and stability of Hadoop to make it more enterprise-aware. Many related systems, such as pig and hive built into the main distribution cluster, want to build software that does not run MapReduce and query Hadoop, but in fact still does not pull out of the MapReduce, the query is only translated from the MapReduce way.
For 2010 years, the Hadoop team thought that Hadoop was the time to reform, Murthy and all the Hadoop community developers assembled to fix the old problem, and the final result was to join Hadoop 2.0 's yarn attachment.
The birth of
yarn
Yarn is a HDFS system that supports developers creating and HDFS interactive applications without having to start the entire Mapreduce,murthy said: "2.0 is not really an arbitrary number, is the second system of Hadoop."
After the yarn has been identified, many new software has also started to be created to further supplement Hadoop. Twitter, for example, uses spark to process data in real time, and Yahoo uses spark to process stored data. Cloudera created Impala to improve the speed of querying Hadoop.
But as long as developers are willing, Murthy says, they can use yarn to query Hadoop, making the entire system of large data more efficient.
Nodeable, an IT testing company, built an integrated system between its storm and Hadoop, called Streamreduce, its Vice-President (Appcelerato Vice President, Nodeable was appcelerator to acquire ) means that yarn is what they need in the future for batch processing or real-time processing.
Hadoop 2.0
Spark mainly runs on HDFs, although it discards MapReduce and is away from the official Hadoop, but yarn enough to make them interconnected, and if you want a simple deployment, you don't have to yarn, but some users like it and want to install it.
At present, yarn has existed in many Hadoop distributions, including Cloudera distributions. Official Hadoop 2.0 Open Source Project Beta release is coming soon, it may take some time to penetrate the market completely, but it will make a big difference when it is popular, anyway, we would like to thank the 3 o'clock in the morning phone.