is Hadoop going to be out of date?

Source: Internet
Author: User
Tags hadoop ecosystem

is Hadoop going to be out of date? _ Database Technology-Cool Qin Network

The word Hadoop is now overwhelming and almost synonymous with big data. In just a few years, Hadoop has grown rapidly from edge technology to a de facto standard. Now you want to play big data, business analysis or business intelligence, no Hadoop is really bad. But behind the Hadoop mania is a technological revolution, and thecore technology of Hadoop is outdated in Google because Hadoop is not good at "fast data".

Today, Hadoop seems to have become the enterprise big Data technology standard without controversy, and it seems that Hadoop will be rooted in the enterprise and its status will not seem to waver for the next decade. But Gigaom's columnist, Mike Miller, has a "discordant" voice: "Does the business really pay for a technology that is extremely bad?" ”

Origin: Google file system and Google MapReduce

To explore the life cycle of Hadoop we need to backtrack from the source of--google's mapreduce. To meet the challenge of a data explosion, Google's engineers Jeff Dean and Sanjay Ghemawat architected two far-reaching systems: Google File System (GFS) and Google MapReduce (GMR). The former is an excellent solution for managing EB (Exabyte)-level data on general-purpose hardware. The latter is an equally good model designed to implement large-scale parallel processing of data on a common server.

The beauty of GMR is that it allows ordinary Google users and developers to perform high-speed, fault-tolerant, big data processing. GMR and GFs are at the heart of the search engine's data processing engine, which captures, analyzes, and ratings web pages, and ultimately renders daily search results for users.

Hadoop ecosystem

Let's look back at two major parts of Apache Hadoop: The Hadoop Distributed File System and Hadoop, which are really a replica of GFs and GMR. While Hadoop is evolving into an all-encompassing ecosystem of data management and processing, the core of the ecosystem remains the MapReduce system. All data and applications will eventually degrade to map and reduce.

Google has evolved, can Hadoop keep up?

The interesting thing is that GMR no longer occupies a prominent place in the Google software stack. When businesses are locked into MapReduce by Hadoop Solutions, Google is ready to phase out MapReduce technology. Although the Apache project and the Hadoop commercial release attempt to make up for Hadoop's short board through HBase, Hive, and next-generation mapreduce (i.e. yarn). But I think that only with the new, non-mapreduce architecture technology to replace the Hadoop kernel (HDFS and zookeeper) to compete with Google's technology. (Here is a more technical elaboration: Gluecon-miller-horizon)

Incremental Index Filter (Percolator for incremental indexing) and frequent change data set analysis. Hadoop is a large "machine" that handles data when it's up and running at full speed, and the only thing you need to worry about is the speed at which the drive can't keep up. But every time you are ready to start analyzing data, you need to go through all the data, and when the data set becomes larger, the problem will lead to an indefinite extension of the analysis time.

So how does Google resolve to get search results back more and more in real time? The answer is to replace GMR with an incremental processing engine percolator. Returns query results by processing only new, changed, or deleted documents and using a level two index to efficiently build the catalog. "Converting an indexing system into an incremental system ..." wrote the author of the percolator paper. Shortened document processing latency by 100 times times. "This means that the new content of the index web is 100 times times faster than MapReduce!"

Data like the Large Hadron Collider will continue to grow, and so does Twitter. This is also why the triggering process is new in hbase, and Twitter storm is becoming a hot technology for real-time streaming data processing.

Dremel for Point-to-point analysis. both Google and the Hadoop ecosystem are committed to making MapReduce an available point-to-point analysis tool. A lot of interface layers have been created from sawzall to pig and hive, but although this makes Hadoop look more like a SQL system, people forget a basic fact that--mapreduce (and Hadoop) is a system developed for organizing data processing tasks, born in the workflow kernel, Rather than point-to-point analysis.

Today, a large number of bi/analysis queries are point-to-point patterns, which are interactive and low latency analysis. Hadoop's map and reduce workflows have deterred many analysts, and the long cycle of work-start and completion workflows has meant a bad user experience for many interactive analyses. As a result, Google invented the Dremel (industry also known as BigQuery products) dedicated tools, allowing analysts to scan into petabytes (petabyte) in a few seconds to complete the point-to query, but also to support visualization. In Dremel's paper, Google claims: "Dremel can complete aggregate queries of trillions of rows of data in seconds, 100 times times faster than MapReduce!" ”

the Pregel of the analysis chart data. Google MapReduce is designed to analyze the world's largest data atlas-the Internet. However, it is not so good to analyze interpersonal networks, telecommunications equipment, documents, and other graph data, for example, MapReduce is inefficient in computing the single-source shortest path (SSSP), and the existing parallel graph algorithm library parallel BGL or cgmgraph is not fault-tolerant.

So Google developed the Pregel, a large synchronous processing application that can process petabytes of graph data on a distributed universal server. Compared with Hadoop, which often produces exponentially scaled data when processing graph data, Pregel can naturally and efficiently handle graph algorithms such as SSSP or PageRank, which takes much less time and is much simpler in code.

The only open source option that can rival Pregel today is Giraph, an early Apache incubation project that called HDFs and zookeeper. There is also a project golden orb available on the GITHB.

Summarize

In summary, Hadoop is an excellent tool for large-scale data processing on common hardware clusters. But if you want to handle dynamic datasets, point-to-point analysis, or graph data structures, Google has shown us a technology option that is significantly better than the MapReduce paradigm. There is no doubt that percolator, Dremel and Pregel will be the new big three, like Google's old "three giants": GFS, GMR and BigTable.

is Hadoop going to be out of date?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.