The word Hadoop is now overwhelming and has almost become synonymous with big data. In just a few years,
Hadoop has rapidly grown from edge technology to a de facto standard. Nowadays, if you want to play with big data, engage in enterprise analysis or business intelligence, you can't do without Hadoop. But behind the Hadoop craze is brewing a technological change. The core technology of Hadoop is outdated by Google because Hadoop is not good at processing "fast data."
Today, Hadoop seems to have become the standard of enterprise big data technology without any dispute. It seems that Hadoop will be rooted in the enterprise, and its position seems to be unshakable in the next ten years. But Mike Miller, the columnist of GigaOM, made a "disharmonious" voice: "Will companies really pay for a technology that is flourishing and declining?"
In order to explore the life cycle of
Hadoop we need to go back to the source of inspiration for Hadoop-Google's MapReduce. To meet the challenge of the big data explosion, Google engineers Jeff Dean and Sanjay Ghemawat built two far-reaching systems: Google File System (GFS) and Google MapReduce (GMR). The former is an excellent feasible solution that can manage EB (Exabyte) level data on general-purpose hardware. The latter is an equally excellent model design and implementation that can process data in parallel on a general-purpose server.
The beauty of GMR is that it allows ordinary Google users and developers to perform high-speed, fault-tolerant big data processing. GMR and GFS have become the core of the search engine data processing engine, which crawls, analyzes and ranks web pages, and finally presents daily search results to users.
Hadoop ecosystem
Let us look back at the two major components of Apache Hadoop: Hadoop Distributed File System and Hadoop, which are indeed a replica of GFS and GMR. Although Hadoop is developing into an all-encompassing data management and processing ecosystem, at the core of this ecosystem is still the MapReduce system. All data and applications will eventually be degraded into Map and Reduce work.
Google has evolved, can Hadoop keep up?
The interesting thing is that GMR no longer occupies a prominent position in the Google software stack. When enterprises are locked into MapReduce by Hadoop solutions, Google is already preparing to phase out MapReduce technology. Although the Apache project and the commercial release of Hadoop try to make up for the shortcomings of Hadoop through HBase, Hive and the next generation of MapReduce (also known as YARN). But the author believes that only by replacing the Hadoop kernel (HDFS and Zookeeper) with new, non-MapReduce-based technology can it compete with Google's technology. (Here is a more technical explanation: gluecon-miller-horizon)
Incremental index filter (Percolator for incremental indexing) and frequently changing data set analysis. Hadoop is a large "machine". When it is up and running at full speed, the performance of processing data is amazing. The only thing you need to worry about is that the transmission speed of the hard disk cannot keep up. But every time you are ready to start analyzing data, you need to go through all the data. When the data set becomes larger and larger, this problem will cause the analysis time to extend indefinitely.
So how does Google solve the problem of making search results return closer and closer to real-time? The answer is to replace GMR with Percolator, an incremental processing engine. By only processing newly added, changed or deleted documents and using the secondary index to create a catalog efficiently, and return the query results. The author of the Percolator paper wrote: "Converting the indexing system to an incremental system... reduces document processing latency by 100 times." This means that indexing new content on the web is 100 times faster than using MapReduce!
Data like the Large Hadron Collider will continue to grow, and so will Twitter. This is why there will be a new trigger process in HBase, and Twitter Storm is becoming a popular technology for real-time processing of streaming data.
Dremel for point-to-point analysis. Both Google and the Hadoop ecosystem are committed to making MapReduce a usable peer-to-peer analysis tool. From Sawzall to Pig and Hive, a lot of interface layers have been created, but although this makes Hadoop look more like a SQL system, people have forgotten a basic fact-MapReduce (and Hadoop) is a system developed for organizing data processing tasks. Born in the core of workflow, rather than peer-to-peer analysis.
Today, a large number of BI/analytics queries are in a peer-to-peer mode, which is interactive and low-latency analysis. Hadoop's Map and Reduce workflows have discouraged many analysts, and the long cycle of job startup and completion of workflow runs means a bad user experience for many interactive analyses. As a result, Google invented Dremel (also known as BigQuery product in the industry) special tool that allows analysts to scan into PB (Petabyte) data in a few seconds to complete point-to-point queries, and it also supports visualization. Google claimed in Dremel's paper: "Dremel can complete the aggregate query of trillions of rows of data in a few seconds, which is 100 times faster than MapReduce!"
Pregel analyzing graph data. Google MapReduce was originally designed to analyze the world's largest data map-the Internet. However, when analyzing human networks, telecommunications equipment, documents and some other graph data, it is not so bright. For example, MapReduce is very inefficient when calculating the single source shortest path (SSSP). The existing parallel graph algorithm library Parallel BGL or CGMgraph There is no fault tolerance.
So Google developed Pregel, a large-scale synchronous processing application that can process PB-level graph data on a distributed general-purpose server. Compared with Hadoop which often produces exponential data amplification when processing graph data, Pregel can naturally and efficiently process graph algorithms such as SSSP or PageRank, which takes much shorter time and the code is much simpler.
At present, the only open source option comparable to Pregel is Giraph, which is an early Apache incubation project that uses HDFS and Zookeeper. There is also a project Golden Orb available on Githb.
to sum up
All in all, Hadoop is an excellent tool for large-scale data processing on common general-purpose hardware clusters. But if you want to deal with dynamic data sets, peer-to-peer analysis, or graph data structures, then Google has shown us technology options that are much better than the MapReduce paradigm. There is no doubt that Percolator, Dremel and Pregel will become the new "Big Three" of big data, just as Google's old "Big Three": GFS, GMR and BigTable did.