When it comes to big data, I believe you are not unfamiliar with the two names of Hadoop and Apache Spark. But we tend to understand that they are simply reserved for the literal, and do not think deeply about them, the following may be a piece of me to see what the similarities and differences between them.
The problem-solving dimension is different.
First, Hadoop and Apache Spark are both big data frameworks, but their purpose is different. Hadoop is essentially more of a distributed data infrastructure: it allocates huge datasets to multiple nodes in a cluster of ordinary computers to store, meaning you don't need to buy and maintain expensive server hardware.
At the same time, Hadoop indexes and tracks these data, making big data processing and analysis more efficient than ever before. Spark, then, is a tool dedicated to the processing of big data for distributed storage, which does not store distributed data.
Both can be divided into
In addition to providing a common understanding of HDFS distributed data storage capabilities, Hadoop also provides data processing capabilities called MapReduce. So here we can completely throw off spark and use Hadoop's own mapreduce to do the processing of the data.
Instead, spark does not have to cling to Hadoop to survive. But as mentioned above, after all, it does not provide a file management system, so it must be integrated with other distributed file systems to operate. Here we can choose the HDFs of Hadoop, or choose a different cloud-based data system platform. But Spark is still being used on Hadoop by default, after all, it's considered the best combination of all.
The following is the most concise and clear analysis of MapReduce from the online excerpt from Heaven Zhuhai:
We need to count all the books in the library. You count 1th bookshelves, I'll count 2nd. This is "Map". The more we are, the faster we can count the books.
Now let's get together and put together the statistics of all. This is "Reduce".
Spark data processing speed seconds kill MapReduce
Spark is a lot faster than mapreduce because it handles data differently. MapReduce processes data in steps: "read data from a cluster, perform a process, write the results to a cluster, read the updated data from the cluster, perform the next processing, write the results to the cluster, etc. ..." Booz Allen Hamilton's data scientist, Kirk Borne, is so analytical.
With Spark, it does all the data analysis in memory in close to "real time": "Read the data from the cluster, complete all the necessary analytical processing, write the results back to the cluster, complete," born said. Spark's batch process is nearly 10 times times faster than MapReduce, and data analysis in memory is nearly 100 times times faster.
If the data and result requirements that need to be processed are static in most cases, and you have the patience to wait for the batch to complete, the mapreduce approach is perfectly acceptable.
But if you need to stream data for analysis, such as the data collected by sensors from the factory, or if your application requires multiple data processing, you might be more likely to use spark for processing.
Most machine learning algorithms require multiple data processing. In addition, there are often scenarios where spark is used: real-time campaigns, online product recommendations, network security analysis, machine diary monitoring, and more.
Disaster recovery
The disaster recovery methods are different, but they are very good. Because Hadoop writes every processed data to disk, it is inherently resilient to handling system errors.
The data objects of spark are stored in a distributed data set (Rdd:resilient distributed dataset) distributed in a data cluster. "These data objects can be placed either in memory or on disk, so RDD can also provide complete disaster recovery capabilities," borne points out.
Note: More articles please pay attention to the public number: Techgogogo or personal blog http://techgogogo.com. Of course, you are also very welcome to pick up directly (ZHUBAITIAN1). This paper is compiled from Inforworld by Heaven Zhuhai Branch Rudder. Reproduced please consciously.
2 minutes to read the Big Data framework Hadoop and spark similarities and differences