How long will the brilliance of Hadoop last?

Source: Internet
Author: User
Keywords nbsp Google when we can

Hadoop technology is everywhere. For better or worse, Hadoop has become synonymous with big data. In just a few years, Hadoop has become a de facto standard from an edge technology. It seems that not only is Hadoop now a standard for corporate data, but its position in the future seems to be hard to shake.

Google File system and MapReduce

Let's start by exploring the soul of Hadoop--mapreduce. Faced with explosive data growth, Google's engineers Jeff Dean and Sanjay Ghemawat Architecture and released two groundbreaking systems: Google File System (GFS) and Google MapReduce (GMR). The former is an excellent and practical solution-using regular hardware extensions and managing data, which is equally glorious, creates a computational framework for large-scale parallel processing.

Google MapReduce (GMR) provides a simple way for ordinary developers/users to do large data processing, and makes it fast and fault tolerant. Google's file system (GFS) and Google MapReduce (GMR) have also provided the core impetus for Google's search engine to crawl and analyze Web pages.

Looking back at the Open-source world of Hadoop,apache Hadoop's Distributed File System (HDFS) and Hadoop MapReduce are all Open-source implementations of Google's file system (GFS) and Google MapReduce (GMR). The Hadoop project has evolved into an ecosystem that touches every aspect of the big data field. But fundamentally, its core is mapreduce.

Can Hadoop overtake Google?

An interesting phenomenon is that MapReduce is no longer prominent in Google. When corporate attention MapReduce, Google seems to have entered the next era. In fact, the technologies we are talking about are not new technologies, and MapReduce is no exception.

I hope these technologies will be more competitive under the post-Hadoop era. Although many of the Apache community projects and commercial Hadoop projects are very active and are constantly refining the Hadoop system with technology from HBase, Hive and the next generation of MapReduce (YARN), I still think The Hadoop core (HDFs and zookeeper) needs to break away from the mapreduce and enhance its competitiveness with a new architecture, really with Google technology a higher.

Filter the growing index and analyze the ever-changing dataset. The great thing about Hadoop is that once it starts running, it quickly analyzes your data. However, after each data analysis, that is, after adding, changing, or deleting data, we must stream the entire dataset. This means that as the dataset expands, the analysis time increases and is not expected.

So how does Google make the search results appear more and more real? An incremental processing engine named Percolator replaces Google MapReduce (GMR). By processing new, changed, and deleted documents and using a level two index for efficient classification and querying, Google can significantly reduce the time it will take to achieve its goals.

Percolator's author writes: "Turn indexing system into an incremental system ... The average document processing latency is reduced to 100. "This means that the index of new content on the Web is 100 times times faster than the previous MapReduce system."

Google Dremel Instant Data analysis Solution

Google and the Hadoop community have worked to build easy-to-use real-time data analysis tools based on MapReduce, such as Google's parallel processing language Sawzall,apache pig and hive. But for those familiar with SQL, they ignore a basic fact-the goal of building MapReduce is to manage data processing. Its core competencies are workflow management, not real-time data analysis.

In stark contrast, many bi or data analysis queries essentially require immediate, interactive, and low latency. This means that using Hadoop requires not only planning flowcharts, but also reducing unnecessary workflows for many query analyses. Even so, it takes a few minutes to wait for the work to begin, and then spend hours waiting for the workflow to complete, and the process is very detrimental to the interactive experience. Therefore, Google developed a Dremel to deal with. Dremel is Google's interactive data analysis system that handles PB-level data in seconds and can easily respond to instant queries.

Google Dremel design Features:

Dremel is an extensible large system. In a PB-level dataset, the task is shortened to the second level, which undoubtedly requires a lot of concurrency. The sequential read speed of a disk is 100mb/s up and down, then processing 1TB data within 1S means that you need at least 10,000 disks to read concurrently! Google has always been a good player with cheap machines to do big things. But the more machines, the greater the probability of problems, such a large cluster size, need to have sufficient fault-tolerant considerations, to ensure that the speed of the entire analysis is not affected by individual nodes in the cluster.

Dremel is a supplement to MapReduce. Like MapReduce, Dremel needs a file system such as GFS as a storage layer. At the beginning of design, Dremel is not a substitute for mapreduce, it can only perform very fast analysis, often used to process MapReduce result sets or to build analysis prototypes.

The Dremel data model is nested. Internet data is often of a non relational type. Dremel also requires a flexible data model, which is critical. Dremel supports a nested data model, similar to JSON. The traditional relational model, because of the inevitable large number of join operations, in the processing of such large-scale data, often powerless.

The data in Dremel is stored in a column style. Using column storage, you can reduce the amount of CPU and disk traffic by scanning only the part of the data that you need when you analyze it. At the same time, column-type storage is compressed and user-friendly, using compression, can synthesize CPU and disk, maximize performance.

Dremel combines web search and parallel DBMS technology. Dremel the concept of "query tree" in Web search, and divides a relatively large and complex query into smaller and simpler queries. Small, trivial, can be concurrent in a large number of nodes run. In addition, like parallel DBMS, Dremel can provide a sql-like interface, just like hive and pig.

Google's Graph Data computing framework Pregel

Google MapReduce is specifically designed to crawl, analyze the world's largest graphics architecture-internet design, but for large-scale mapping algorithms (such as graph traversal (BFS), PageRank, the shortest Path (SSSP) and so on, the calculation appears inefficient. So Google built the Pregel.

Pregel is very impressive. Pregel can not only perform SSSP or PageRank algorithms efficiently, but it is even more surprising that the published data shows that the Pregel process a graph with billions of nodes and tens of billions of edges, which can be completed in minutes, and its execution time increases linearly with the size of the graph.

Pregel is based on the BSP model, which is the "compute"-"communication"-"sync" pattern:

The input output is a forward graph

into a hyper step

Node-centric computing, in which each node performs its own task, and the order of execution nodes is indeterminate

Between two steps is the communication phase

In Pregel, the node is the center of the calculation. Step 0 o'clock each node is active, each node actively "to stop voting" into the inactive state. Activated if a message is received. The entire algorithm ends when there are no active nodes and messages. Fault tolerance is done through checkpoints. At the beginning of each step, the master-slave node is backed up separately.

Summary

While Hadoop is still at the heart of the current big data technology, Google has shown us many of the more advanced big data technologies. Google's intention to develop these technologies is not to immediately abandon MapReduce, but it is no doubt the trend of big data technology in the future. Despite the fact that the big data technology has been open source, we can't help but wonder how long the brilliance of Hadoop will last.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.