Connections and Differences between Hadoop and Spark
Source: Internet
Author: User
Keywordshadoop spark difference hadoop spark
When it comes to
big data, I believe everyone is no stranger to the two names Hadoop and Apache Spark. However, our understanding of them is only mentioned in the text, and we do not think about them in depth. Let ’s take a look at the differences and similarities between them.
Different levels of problem solving
First, both
Hadoop and
Apache Spark are big data frameworks, but their respective purposes are different. Hadoop is essentially more of a distributed data infrastructure: it distributes huge data sets to multiple nodes in a cluster of ordinary computers for storage, meaning that you do not need to purchase and maintain expensive server hardware.
At the same time, Hadoop will index and track these data, so that the efficiency of big data processing and analysis has reached an unprecedented level. Spark is a tool specifically for processing big data in distributed storage. It does not store distributed data.
The two can be combined
In addition to the HDFS distributed data storage function that everyone agrees on, Hadoop also provides a data processing function called MapReduce. So here we can completely set aside Spark and use Hadoop's own MapReduce to complete the data processing.
On the contrary, Spark does not have to rely on Hadoop to survive. But as mentioned above, after all, it does not provide a file management system, so it must be integrated with other distributed file systems to operate. Here we can choose Hadoop's HDFS or other cloud-based data system platforms. But Spark is still used on Hadoop by default. After all, everyone thinks their combination is the best.
The following is the most concise and clear analysis of
MapReduce excerpted from the Internet:
We want to count all the books in the library. You count the number 1 bookshelf, and I count the number 2 bookshelf. This is "Map". The more people we have, the faster we can count books.
Now let ’s get together and add all the stats together. This is "Reduce".
Spark data processing speed spike MapReduce
Spark will be much faster than MapReduce because it processes data differently. MapReduce processes the data step by step: "Read data from the cluster, perform a process, write the result to the cluster, read the updated data from the cluster, perform the next process, and write the result to the cluster, Wait ... "Booz Allen Hamilton's data scientist Kirk Borne explained.
In contrast to Spark, it will complete all data analysis in memory in near "real time": "Read data from the cluster, complete all necessary analysis processing, write the results back to the cluster, and complete," Born said. Spark's batch processing speed is nearly 10 times faster than MapReduce, and the data analysis speed in memory is nearly 100 times faster.
If the data and result needs to be processed are mostly static, and you have the patience to wait for the completion of batch processing, MapReduce's processing method is also completely acceptable.
But if you need to analyze streaming data, such as those collected by sensors from the factory, or your application requires multiple data processing, then you should probably use Spark for processing.
Most machine learning algorithms require multiple data processing. In addition, the application scenarios of Spark are usually used in the following aspects: real-time market activities, online product recommendations, network security analysis, machine diary monitoring, etc.
Disaster recovery
The disaster recovery methods are very different, but they are very good. Because Hadoop writes each processed data to disk, it is inherently flexible in handling system errors.
Spark's data objects are stored in a distributed distributed data set (RDD: Resilient Distributed Dataset). "These data objects can be placed in memory or on disk, so RDD can also provide complete disaster recovery," Borne pointed out.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.