Abstract: hadoop has become synonymous with big data. In just a few years, hadoop has become a de facto standard from an edge technology. On the other hand, mapreduce is no longer prominent in Google. When companies are focusing on mapreduce, Google seems to have already entered the next era.
Hadoop technology is everywhere. Whether it is good or bad, hadoop has become synonymous with big data. In just a few years, hadoop has become a de facto standard from an edge technology. It seems that hadoop is not only the standard of Enterprise big data, but also its position in the future seems to be unshakable for a moment.
Google file system and mapreduce
Let's first discuss the soul of hadoop-mapreduce. In the face of explosive growth in data, Google engineers Jeff Dean and Sanjay Ghemawat have developed two groundbreaking systems: Google File System (GFS) and Google mapreduce (GMR ). The former is an excellent and practical solution-using conventional hardware to expand and manage data, and the latter is equally brilliant, creating a computing framework suitable for large-scale parallel processing.
Google mapreduce (GMR) provides a simple way for common developers/users to process big data and make it fast and fault tolerant. Google's file system (GFS) and Google's mapreduce (GMR) also provide the core motivation for Google's search engine to capture and analyze webpages.
Let's look back at hadoop In the open-source world. The Distributed File System (HDFS) of Apache hadoop and hadoop mapreduce are completely open-source implementations of Google File System (GFS) and Google mapreduce (GMR. The hadoop project has developed into an ecosystem and has reached all aspects of the big data field. But fundamentally, its core is mapreduce.
Can hadoop catch up with Google?
An interesting phenomenon is that mapreduce is no longer prominent in Google. When companies are focusing on mapreduce, Google seems to have already entered the next era. In fact, the technologies we are talking about are no longer new technologies, and mapreduce is no exception.
I hope these technologies will be more competitive in the Post-hadoop era. Although many Apache community projects and commercial hadoop projects are very active, and the hadoop system has been continuously improved by hbase, hive, and next-generation mapreduce (yarn) technologies, I still believe that, hadoop core (HDFS and zookeeper) needs to break away from mapreduce and enhance their competitiveness with a new architecture, truly competing with Google technology.
Filters increasing indexes and analyzes ever-changing datasets. The greatness of hadoop is that it will quickly analyze your data once it starts to run. However, before each data analysis, that is, after adding, modifying, or deleting data, we must stream the entire dataset. This means that as the dataset expands, the analysis time increases and is unpredictable.
So how does Google make search results more and more real-time? A incremental Processing Engine named percolator replaces Google mapreduce (GMR ). By processing new, changed, and deleted documents and using secondary indexes for efficient classification and query, Google can significantly reduce the time to achieve its goal.
The author of percolator wrote: "converting the index system into an incremental system ...... The average processing latency of documents has been reduced to 100 ." This statement means that the indexing speed of new content on the Web is 100 times faster than that of the previous mapreduce system.
Google dremel Real-time Data Analysis Solution
The Google and hadoop communities once devoted themselves to building mapreduce-based instant data analysis tools for ease of use, such as Google's parallel processing language Sawzall, Apache pig, and hive. But for those familiar with SQL, they ignore the basic fact that the goal of building mapreduce is to manage data processing. Its core capability lies in workflow management, rather than real-time data analysis.
In stark contrast, many bi or data analysis queries require instant, interactive, and low latency. This means that using hadoop requires not only planning flowcharts, but also cutting unnecessary workflows for many queries and analyses. Even so, it takes several minutes to wait for the job to begin, and then several hours to wait for the workflow to complete, and this process is not conducive to the interactive experience. Therefore, Google developed dremel to respond. Dremel is Google's "interactive" data analysis system. It can process petabytes of data in a few seconds and easily cope with instant queries.
Google dremel's design features:
Dremel is a scalable large system. In a Pb-level data set, tasks are reduced to seconds, which requires a large amount of concurrency. The sequential read speed of the disk is up or down 100 Mb/s. Processing 1 TB of data within 1 second means that at least 10 thousand disks need to be read concurrently! Google has always been a good player with cheap machines. However, the more machines there are, the higher the probability of a problem. Such a large cluster size requires sufficient fault tolerance to ensure that the entire analysis speed is not affected by individual nodes in the cluster.
Dremel is a supplement to mapreduce. Like mapreduce, dremel also needs a file system like GFS as the storage layer. At the beginning of the design, dremel is not a substitute for mapreduce. It can only perform very fast analysis. When used, it is often used to process mapreduce result sets or to build analysis prototypes.
The data model of dremel is nested. Internet data is often non-relational. Dremel also needs a flexible data model, which is crucial. Dremel supports a nested data model, similar to JSON. The traditional relational model, due to the inevitable large number of join operations, is often powerless when processing such a large amount of data.
Data in dremel is stored in columns. When using columnar storage, you can only scan the required data to reduce the access volume of CPU and disk. At the same time, columnar storage is compress friendly. With compression, the CPU and disk can be integrated to maximize the efficiency.
Dremel combines web search and parallel DBMS technologies. Dremel draws on the concept of "query Tree" in Web search to split a relatively large and complex Query into small and simple queries. It's easy to make things easier, and you can run on a large number of nodes concurrently. In addition, similar to parallel DBMS, dremel can provide an SQL-like interface, just like hive and pig.
Google's graphic data computing framework Pregel
Google mapreduce is designed specifically to capture and analyze the world's largest graph architecture-Internet, but for large-scale graph algorithms (traversal (BFS), PageRank, SSSP). Therefore, Google built Pregel.
Pregel is very impressive. Pregel not only can efficiently execute SSSP or PageRank algorithms, but is even more surprising that the published data shows that Pregel can process a graph with billions of nodes and trillions of edges in just a few minutes, the execution time increases linearly with the graph size.
Based on the BSP model, Pregel is the "computing"-"communication"-"synchronization" Mode:
- Input and output are directed graphs.
- Split into supersteps
- Node-centric computing. In a superstep, each node executes its own tasks. The Node execution sequence is uncertain.
- Two supersteps are in the Communication phase
In Pregel, node-centric computing. When STEP 0 is used, each node is active, and each node takes the initiative to "stop voting" to enter the inactive state. If a message is received, it is activated. When no active node or message exists, the entire algorithm ends. Fault Tolerance is done through checkpoints. At the beginning of each superstep, the master and slave nodes are backed up separately.
Summary
Although hadoop is the core of big data technology, Google has shown us many more advanced big data technologies. The intention of Google to develop these technologies is not to immediately discard mapreduce, but it is undoubtedly the trend of big data technology in the future. Although the open-source implementation of the above big data technology has emerged, we can't help but ask how long can hadoop's glory continue? (Zhang zhiping/compile)