Different Swiss Army knives: vs. Spark and MapReduce

Last Update:2016-02-28 Source: Internet

Author: User

Tags hortonworks hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article by Bole Online-Guyue language translation, Gu Shing Bamboo School Draft. without permission, no reprint!

Source: http://blog.jobbole.com/97150/

Spark from the Apache Foundation detonated the big Data topic again. With a promise of 100 times times faster than Hadoop MapReduce and a more flexible and convenient API, some people think this may herald the end of Hadoop MapReduce.

As an open-source data processing framework, how does Spark handle data so quickly? The secret is that it runs in the memory of the cluster and is not constrained by the two-phase paradigm of MapReduce. This greatly speeds up the repeated access to the same data.

Spark can be run either alone or on Hadoop YARN (note: The improved framework in the second generation of Hadoop frameworks for separating resource management from processing components, YARN-based structures from MapReduce constraints), and Spark is able to directly from HDFS (Hadoop Distributed File System distributed filesystem) reads data. Companies such as Yahoo, Intel, Baidu, Trend Micro (Trend Micro) and Groupon (Hi-tech) are already using Spark.

It sounds like Spark is destined to replace Hadoop MapReduce. But is that really the case? In this article we will compare these two platforms to see if Spark really outmanoeuvred.

　　Performance　　

Spark processes data in memory, while Hadoop MapReduce processes data on disk through map and reduce operations. So from this point of view, the performance of Spark should be more than Hadoop MapReduce.

However, since processing in memory, Spark requires a large amount of memory. Just like a standard database system operation, Spark loads the process into memory every time, and then the operation remains in memory as a cache until the next step. If Spark is running on Hadoop YARN with other resource demand services, or if chunks of data are too large to fully read into memory, there is a significant decrease in spark performance.

In contrast, MapReduce ends the process as soon as a job is completed, so it can be easily run together with other services without significant performance degradation.

Spark has its own advantages when it comes to iterative computations that require repeated reads of the same data. But when it comes to tasks such as data conversion, data integration, and so on, with a single read, like ETL (decimation, transform, load) operations, MapReduce is definitely the perfect choice because it's born for that.

Summary: Spark behaves better when data size is suitable for reading into memory, especially on dedicated clusters; Hadoop MapReduce works for situations where data cannot be read all in memory, and it can also run concurrently with other services.

　　Use difficulty　

Spark has a flexible and convenient Java,scala and Python API, and Spark also applies spark SQL (the previously known Shark) to technical staff who are already familiar with SQL. Thanks to the easy-to-use construction modules offered by Spark, we can easily write custom functions. It even includes an interactive command pattern that can be instantly fed back.

Hadoop MapReduce is written in Java, but is criticized for its difficult programming. Although it takes some time to learn grammar, Pig simplifies the process to some extent, and hive also provides SQL compatibility for the platform. Some Hadoop tools can also run MapReduce tasks directly without programming. Xplenty is a Hadoop-based data integration service and does not require any programming or deployment.

Although Hive provides a command-line interface, MapReduce does not have an interactive mode. Projects such as Impala,presto and Tez are trying to provide a fully interactive query pattern for Hadoop.

In terms of installation and maintenance, Spark is not tied to Hadoop, although both spark and Hadoop MapReduce are included in their distributed systems in Hortonworks (HDP version 2.2) and Cloudera (CDH 5) products. (Note: Cloudera, Hortonworks and MapR are three of the most well-known startups in the Hadoop world, dedicated to creating better Hadoop enterprise applications).

Summary: Spark is easier to program and also includes interactive mode; Hadoop MapReduce is not easy to program but many existing tools make it easier to use.

Cost

Both Spark and Hadoop MapReduce are open source, but the cost of machines and labor is still unavoidable.

Both frameworks can be run on a commercial server or in the cloud, and the following table can see similar hardware requirements:

The memory of the Spark cluster is at least as large as the block of data that needs to be processed, because only the data block and memory size are appropriate to perform their optimal performance. So if you really need to deal with very large data, Hadoop is definitely the right choice, after all, the cost of hard drives is much lower than the cost of memory.

Given the performance standards of Spark, the hardware required to perform the same tasks is less and faster, so it should be more cost-effective, especially in the cloud, when you just need to pay at once.

On the technical side, even though Hadoop has been popularized since 2005, there is still a shortage of experts in MapReduce. And what does that mean for Spark, which has been popularized since 2010? Perhaps the Spark-learning people are growing fast, but there is still a big gap in technical talent compared to Hadoop MapReduce.

Further, the existence of a large number of Hadoop-as-a-service data and Hadoop-based services, such as our xplenty data Integration Services, reduces the need for technician capabilities and underlying hardware knowledge. In contrast, there is virtually no available Spark service, and the only ones are new.

Summary: Based on benchmark requirements, Spark is more cost-effective, although labor costs can be high. Hadoop MapReduce can be cheaper by relying on more skilled technicians and the supply of Hadoop as a service.

Compatibility

Spark can be run on its own, on Hadoop YARN, or on pre-built Mesos and in the cloud. It supports data sources that implement the Hadoop input paradigm, so you can consolidate all of the data sources and file formats supported by Hadoop. Based on the Spark's official tutorial, it can also be run with BI (business Intelligence) tools through JDBC and ODBC. Hive and Pig are also gradually implementing such a function.

Summary: Spark and Hadoop MapReduce have the same data type and data source compatibility.

Data Processing

In addition to the usual data processing, Spark can do much more than that: it can also process graphs and leverage existing machine learning libraries. High performance also makes Spark behave as well as batch processing on real-time processing. This also gives rise to a better opportunity to solve all problems with a single platform rather than just selecting different platforms based on the task, after all, all platforms need to learn and maintain.

Hadoop MapReduce is outstanding on batch processing. If you need to do real-time processing, you can use another platform like Storm or Impala, and figure processing can use Giraph. MapReduce used to be a machine-learning Mahout, but its owners have turned their back on Spark and H2O (machine learning engine).

Summary: Spark is a Swiss Army knife for data processing, and Hadoop MapReduce is a batch of assault knives.

Fault tolerant

Like MapReduce, Spark retries each task and performs the prediction. However, MapReduce is dependent on the hard drive, so if a process fails halfway through, it can continue to execute from the failure, and Spark has to execute from scratch, so MapReduce saves time.

Summary: Both Spark and Hadoop mapreduce have good fault tolerance, but Hadoop MapReduce is a little bit better.

Security

At this point in security, Spark is still slightly inadequate. Authorization authentication is supported by the shared secret key mechanism, and the network user interface is protected by the servlet filter and the event log. Spark can run on YARN and work with HDFs, which means it also has Kerberos Authentication authorization authentication, HDFS file licensing mechanism, and encryption mechanism between nodes.

Summarize

Spark is a rising star in the big Data world, but Hadoop MapReduce still has a wide range of applications.

Data processing in memory makes Spark better performance and cost effective. It is compatible with all Hadoop data sources and file formats, and the easy-to-use APIs that support multiple languages make it faster to get started. Spark even implements graph processing and machine learning tools.

Hadoop MapReduce is a more mature platform for batch processing. It can be more cost-effective than Spark when it encounters really big data that can't be fully read into memory, or relies on a lot of skilled technicians who have experience with the platform. and the derivative systems around Hadoop MapReduce are growing stronger with more support projects, tools, and cloud services.

But even if it seems like Spark is the ultimate winner, the problem is that we never use it alone-we need HDFS to store data, and maybe we need to use Hbase,hive,pig,impala or other Hadoop projects. This means that when dealing with very large data, Spark still needs to work with Hadoop and MapReduce.

Hadoop MapReduce has all the security mechanisms that Hadoop supports, as well as other Hadoop-based security artifacts such as Knox gateways and Sentry. The Rhino project to resolve Hadoop security only added Spark support when adding Sentry support. Otherwise, Spark developers can only improve their security by themselves.

Summary: The security mechanism for Spark is still in the development phase. Hadoop MapReduce has more security control mechanisms and projects.

Different Swiss Army knives: vs. Spark and MapReduce

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More