In the past, Hadoop seemed to be synonymous with big data. But with the recent deepening of large data applications, it has become increasingly popular to just think of it as a storage tool for large data.
But that's not necessarily a bad thing. Taking Hadoop as a cheap and efficient storage is just the perfect starting point for the next phase of Hadoop's evolution. The Hadoop 2.0, which is to be unveiled this summer, will make the information in the Data warehouse and the unstructured data pool more accessible than ever before.
Hadoop barrel
Hadoop has been a great data storage system since it became a big data tool, but the mapreduce that need to develop Java applications to access the data are harder to learn.
Of course, there are other ways to get information from Hadoop. HBase data is part of Hadoop, which allows users to work with data in a database paradigm. Hive Data Warehouse allows you to use SQL Hivesql query language to create queries and translate them into mapreduce tasks. But Hadoop is still restricted to single-threaded. MapReduce tasks, hive queries, hbase operations, and so on, take turns.
This is why many large data vendors tend to use Hadoop only as a data container, and to improve efficiency, they develop their own tools to get or analyze the data. Although it is portrayed as a big barrel, Hadoop users have seen it as a data great lake or even a data ocean. But the sheer size of the stuff doesn't work, and those restrictions affect the selling points of Hadoop.
The development community of Hadoop is also aware of this problem, which is about to be largely lifted as Hadoop is about to iterate over the new version.
Yarn Solutions
The most important change for the Hadoop 2.0 release manager, Arun Murthy, is that the MapReduce framework is upgraded to Apache YARN, which expands the types of software and applications that can be applied in Hadoop. Arun Murthy, who is the yarn project director, points out that the difference between Hadoop 1.0 and 2.0 is that everything in the former is batch oriented, while the latter allows multiple applications to access the data internally.
By separating these functions from what the current MapReduce system can handle, the management of Hadoop cluster resources is more powerful. Its main management style is similar to the operating system's handling of tasks, that is, there is no longer an operational limit.
With yarn, developers are able to develop applications directly within Hadoop, rather than sifting through data outside, as many third-party tools do.
Murthy says there are already suppliers who are interested in developing applications within the yarn framework. Murthy estimates that the powerful beta version of Hadoop 2.0 is likely to be launched in June or July, and the official version may be released in August.
If yarn does fulfill its promise, developers will be able to easily access many data from the native Hadoop platform to the Great Lakes Sea, making it more fluid and convenient to search for useful information. By then, large data will become more useful and more popular.
(Responsible editor: Fumingli)