2 minutes to understand the similarities and differences between the big data framework Hadoop and Spark
Speaking of big data, I believe you are familiar with Hadoop and Apache Spark. However, our understanding of them is often simply taken literally, and we do not have to think deeply about them. Let's take a look at their similarities and differences with me.
Different levels of problem solving
First, Hadoop and Apache Spark are both big data frameworks, but their respective purposes are different. Hadoop is essentially a distributed data infrastructure: it distributes massive datasets to multiple nodes in a cluster composed of common computers for storage, this means that you do not need to purchase or maintain expensive server hardware.
At the same time, Hadoop will index and track the data, so that the efficiency of big data processing and analysis has reached an unprecedented level. Spark is a tool dedicated to processing the big data stored in distributed storage. It does not store distributed data.
The two can be split
In addition to the HDFS distributed data storage function, Hadoop also provides the data processing function called MapReduce. Therefore, we can simply put aside Spark and use Hadoop's own MapReduce to process data.
On the contrary, Spark does not have to be attached to Hadoop to survive. But as mentioned above, after all, it does not provide a file management system, so it must be integrated with other distributed file systems to operate. Here we can select Hadoop HDFS or other cloud-based data system platforms. However, Spark is still used on Hadoop by default. After all, everyone thinks that their combination is the best.
The following is a concise and clear analysis of MapReduce extracted from the Internet by the Tiandi Zhuhai sub-ship:
We need to count all the books in the library. You counted bookshelves 1 and I counted bookshelves 2. This is "Map ". The more people we have, the faster the number of books.
Now let's combine the statistics of all people. This is "Reduce ".
Spark data processing speed spike MapReduce
Spark is much faster than MapReduce because it processes data in different ways. MapReduce processes data step by step: "reads data from the cluster, processes the data once, writes the result to the cluster, reads the updated data from the cluster, and processes the data for the next time, write the results to the cluster, and so on... "Booz Allen Hamilton's data scientist Kirk Borne is parsing this way.
In contrast, Spark completes all data analysis in the memory in near "real-time": "reads data from the cluster, completes all necessary analysis and processing, and writes the results back to the cluster, done, "Born said. Spark's batch processing speed is nearly 10 times faster than MapReduce, and the data analysis speed in memory is nearly 100 times faster.
If the data and result requirements to be processed are mostly static and you have the patience to wait for the batch processing to complete, the MapReduce processing method is completely acceptable.
However, if you need to analyze the streaming data, such as the data collected by the sensors from the factory, or your applications require multiple data processing, you should use Spark for processing.
Most machine learning algorithms require multiple data processing. In addition, Spark is usually used in the following scenarios: Real-Time marketing activities, online product recommendations, network security analysis, and machine diary monitoring.
Disaster recovery
The disaster recovery methods for both are quite different, but both are quite good. Because Hadoop writes the processed data to the disk, it is born to be able to handle system errors elastically.
Spark data objects are stored in an elastic Distributed Dataset (RDD: Resilient Distributed Dataset) Distributed in a data cluster. "These data objects can be stored in both memory and disk, so RDD can also provide disaster recovery capabilities," Borne pointed out.
You may also like the following articles about Hadoop:
Tutorial on standalone/pseudo-distributed installation and configuration of Hadoop2.4.1 under Ubuntu14.04
Install and configure Hadoop2.2.0 on CentOS
Build a Hadoop environment on Ubuntu 13.04
Cluster configuration for Ubuntu 12.10 + Hadoop 1.2.1
Build a Hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)
Configuration of Hadoop environment in Ubuntu
Detailed tutorial on creating a Hadoop environment for standalone Edition