Currently Apache Software Foundation finally released the latest Hadoop2 data analysis platform, which also led to public opinion for the big leap forward in the great leap forward in big data, before Xiaobian wrote "Hadoop is big data applications and why not" A text on the domestic big data market situation analysis. Now Hadoop 2 release, as the media expected to stimulate the same big data applications and development?
I think the first thing to look at, Hadoop 2 made some improvements From the reports, Hadoop 2 biggest improvement is the release of the YARN data processing and service engine for Map / Reduce was improved, at the same time for the Hadoop File System (HDFS) to add high availability features.
You can look at some technical details to access Hadoop data, you need to develop Java applications to achieve Map / Reduce, there will be some learning difficulties, in addition, you can also use Hbase, using the approximate database paradigm to process data. Its Hive data warehouse allows you to create queries using the HiveSQL query language for SQL-like classes and translate them into MapReduce tasks. However, Hadoop is still limited to single-threaded. MapReduce tasks, Hive queries, Hbase operations, etc. must take turns, this is the limit.
The Hadoop development community is also aware of this issue, making improvements in Hadoop2 and upgrading Map / Reduce to Apache Yet Another Resource Negotiator.
ArunMurthy, YARN project manager, points out that the difference between Hadoop 1.0 and 2.0 is that everything in the former is batch-oriented, while the latter allows multiple applications to access data internally at the same time.
In other words, separating these capabilities from the rest of the current MapReduce system can make Hadoop cluster resources more powerful. The main management method is similar to the operating system to handle the task, there is no longer one operation limit.
With YARN, developers are able to develop applications directly within Hadoop, rather than filtering out data as many third-party tools do.
From Hadoop 1.0 to 2.0 there is no essential difference for the user, just from the technical point of view, to simplify the difficulty of technology development, is a kind of accumulation, rather than a qualitative change. For end users, Map / Reduce Ye Hao, YARN Ye Hao, but is a kind of resource scheduling and use.
So, no matter Hadoop 1.0, 2.0 its biggest contribution still lies in letting us have the opportunity to use X86 and other cheap means to deal with a large amount of structured data, which is the main reason that big data application is widely promoted and discussed. From the current point of view, the domestic big data applications need big data service provider, as to whether these service providers use Map / Reduce, or YARN are not important, the important thing is not the tool, but the service and the result. Neither Map / Reduce nor YARN is available to average non-professionals. It does not want to be as simple as using a PC. All you need now is Map / Reduce, or YARN, who need them to provide specialized services.
Hadoop 2 will promote the application and development of big data, but it is hard to be optimistic that the key issue in China remains unsolved.