Cloudera CTO: Replace MapReduce future will increase spark and other framework inputs

Source: Internet
Author: User
Keywords Very very much.

Over the past two years, the Hadoop community has made a lot of improvements to mapreduce, but the key improvements have been in the code layer, http://www.aliyun.com/zixun/aggregation/13383.html "> Spark, as a substitute for MapReduce, has developed very quickly, with more than 100 contributors from 25 countries, and the community is very active and may replace MapReduce in the future.

The high latency of mapreduce has become a bottleneck in Hadoop development, and it has become a common understanding in the Hadoop community for the current mapreduce to look for higher-performance alternatives.

MapReduce

For the MapReduce framework, the earliest time to go back to Google,google is to combine this framework with flexible, scalable storage to address various data processing and analysis tasks. Doug Cutting and Mike Cafarella used this architecture when they co-founded Apache Hadoop in 2005.

Similar projects, such as Apache Pig and Apache Hive, translate specialized queries into tasks that can run on a multifunctional mapreduce framework, as well as mapreduce scalability, fault tolerance, good throughput, and poor latency, In particular hive, delays make it unable to cope with interactive applications.

The complaints about MapReduce have reduced the enthusiasm for enterprise data centers and Hadoop projects, mapreduce latency is too high, and batch-mode responses have struggled to cope with the large number of applications that need to process analytical data.

The Hadoop biosphere requires a more powerful, flexible, and real-time system than mapreduce.

Spark

Now the main replacement for MapReduce is the Apache Spark. Like MapReduce, it is also a multi-function engine, but the spark design takes into account running more load and faster.

The initial mapreduce perform tasks in a simple way, but their structure is strict: processing or transforming (map), synchronizing (shuffle), and consolidating the results of all nodes together (reduce) in the cluster. You have to turn the problem into a series of mapreduce tasks, and then perform these tasks sequentially, with a high latency. Before the execution of the previous task, no task could begin, and running complex, multi-stage applications was a pain in the neck.

An alternative is to have the developer build a complex, multi-step, and DAG graph of tasks, executing all of them at once without needing one in order. This scheme avoids troublesome synchronization problems in the MapReduce and makes the application more simple to build. For the DAG engine research, Microsoft started earlier, for example: Dryad,dryad has been using it in-house for Bing search and other hosting services.

These ideas are included in the spark, and there are some important innovations, such as: Spark supports memory data sharing across DAG, enabling different tasks to process the same data at very high speeds. Spark even supports cyclic data streams, which makes it much better to handle iterative graph algorithms (commonly used in social network analysis), machine learning, and streaming, which is hard to do through mapreduce or other DAG engines.

Spark includes many advanced features such as streaming, fast failback, language integration APIs, optimization scheduling, and data transfer. Memory usage is the most noticeable place in spark, mapreduce need to frequently process data stored on disk, in contrast, spark can take advantage of the large amount of RAM scattered across all nodes in the cluster, which can intelligently utilize disk to solve overflow data and persistence problems. This gives spark a huge performance advantage in coping with the load.

Why not improve mapreduce and replace it?

In the past two years, the Hadoop community has made a lot of improvements to MapReduce, but most of these improvements are just at the code level, which software developers call "technical debt" based on the original code, which leads to improvements on the original basis that can only solve a temporary problem, and in this sense, MapReduce is already in debt.

Creating an entirely new code base (no technical liability), designed for current and future-foreseeable workloads, is relatively simple and less risky. The question to consider is: do we really need to create a whole new project?

As a substitute for mapreduce, Spark has grown more mature, has more than 100 contributors from 25 countries, and the community is very active, and there is virtually no need to create a new project.

In the long run, we expect to reduce input on the mapreduce, and correspondingly increase input on the new framework, such as: Impala and Spark, and, of course, the load running on that platform will gradually shift to the new framework. Google has started moving loads from MapReduce to Pregel and Dremel, while Facebook has shifted the load to Presto.

Original link: Http://www.csdn.net/article/2014-05-05/2819604-BigData-MapReduce-Spark

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.