Topic Center

Contact Sales

Home > Internet > Online Trends

Connections and Differences between Hadoop and Spark

Last Update:2020-05-25 Source: Internet

Author: User

Keywords hadoop spark difference hadoop spark

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

When it comes to big data, I believe everyone is no stranger to the two names Hadoop and Apache Spark. However, our understanding of them is only mentioned in the text, and we do not think about them in depth. Let ’s take a look at the differences and similarities between them.

Different levels of problem solving

First, both Hadoop and Apache Spark are big data frameworks, but their respective purposes are different. Hadoop is essentially more of a distributed data infrastructure: it distributes huge data sets to multiple nodes in a cluster of ordinary computers for storage, meaning that you do not need to purchase and maintain expensive server hardware.

At the same time, Hadoop will index and track these data, so that the efficiency of big data processing and analysis has reached an unprecedented level. Spark is a tool specifically for processing big data in distributed storage. It does not store distributed data.

The two can be combined

In addition to the HDFS distributed data storage function that everyone agrees on, Hadoop also provides a data processing function called MapReduce. So here we can completely set aside Spark and use Hadoop's own MapReduce to complete the data processing.

On the contrary, Spark does not have to rely on Hadoop to survive. But as mentioned above, after all, it does not provide a file management system, so it must be integrated with other distributed file systems to operate. Here we can choose Hadoop's HDFS or other cloud-based data system platforms. But Spark is still used on Hadoop by default. After all, everyone thinks their combination is the best.

The following is the most concise and clear analysis of MapReduce excerpted from the Internet:

We want to count all the books in the library. You count the number 1 bookshelf, and I count the number 2 bookshelf. This is "Map". The more people we have, the faster we can count books.

Now let ’s get together and add all the stats together. This is "Reduce".

Spark data processing speed spike MapReduce

Spark will be much faster than MapReduce because it processes data differently. MapReduce processes the data step by step: "Read data from the cluster, perform a process, write the result to the cluster, read the updated data from the cluster, perform the next process, and write the result to the cluster, Wait ... "Booz Allen Hamilton's data scientist Kirk Borne explained.

In contrast to Spark, it will complete all data analysis in memory in near "real time": "Read data from the cluster, complete all necessary analysis processing, write the results back to the cluster, and complete," Born said. Spark's batch processing speed is nearly 10 times faster than MapReduce, and the data analysis speed in memory is nearly 100 times faster.

If the data and result needs to be processed are mostly static, and you have the patience to wait for the completion of batch processing, MapReduce's processing method is also completely acceptable.

But if you need to analyze streaming data, such as those collected by sensors from the factory, or your application requires multiple data processing, then you should probably use Spark for processing.

Most machine learning algorithms require multiple data processing. In addition, the application scenarios of Spark are usually used in the following aspects: real-time market activities, online product recommendations, network security analysis, machine diary monitoring, etc.

Disaster recovery

The disaster recovery methods are very different, but they are very good. Because Hadoop writes each processed data to disk, it is inherently flexible in handling system errors.

Spark's data objects are stored in a distributed distributed data set (RDD: Resilient Distributed Dataset). "These data objects can be placed in memory or on disk, so RDD can also provide complete disaster recovery," Borne pointed out.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

Getting Started with CDN 12-02

Front-end Must Learn: CDN Acceleration Principle 12-02

Elements of CDN Network 12-01

Understand the Principle of CDN Acceleration in One Article 12-01

Cloud Security Issues Derived from the Development of Cloud C... 11-26

8 New Types of Attacks Facing the Cloud Environment 11-26

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

Hot Article

Hot Tags

computing conference access forum computer class data get http html applications

Popular Keywords

html add blank space register business logo register ssl certificate full site sign in sign up node js build cloud register register a subdomain in python network management system tutorial how to learn computer science by myself

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Connections and Differences between Hadoop and Spark

Contact Us

Hot Article

Hot Tags

Popular Keywords

Recommend Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support