Spark is a distributed computing platform, a computing framework written in scala language, a fast, universal, and scalable big data analysis engine based on memory
Hadoop is an ecosystem of distributed management, storage, and computing; including HDFS (storage), MapReduce (computing), and Yarn (resource scheduling)
1. Comparison of implementation principles
Hadoop and Spark are both parallel computing, both are calculated using the MR model
A job in Hadoop is called a job. Jobs are divided into Map Task and Reduce Task stages. Each task runs in its own process. When the task ends, the process will also end;
A task submitted by a Spark user is called an application. One application corresponds to one SparkContext. There are multiple jobs in the app. Each trigger action generates a job. These jobs can be executed in parallel or serially. There are multiple stages in each job. The stages are derived from the DAGScheduler dividing the jobs through the dependencies between the RDDs during the shuffle process. Each stage has multiple tasks, which form a taskset. TaskScheduler is distributed to each executor for execution; the life cycle of the executor is the same as the app, even if there is no job running, there is also a task, so the task can quickly start reading memory for calculation.
ps: One Application-> multiple jobs-> one job and multiple stages-> one stage and multiple tasks
2. Comparison of the two aspects
(1) Spark calculates the MR module marked in Hadoop, but the speed and efficiency are much faster than MR;
(2) Spark does not provide a file management system, so it must be integrated with other distributed file systems to operate. It is only a calculation and analysis framework, specifically used to calculate and process distributed storage data. It cannot itself Storing data;
(3) Spark can use Hadoop's HDFS or other cloud data platforms for data storage, but generally uses HDFS;
(4) Spark can use HBase database based on HDFS, or can use HDFS data files, and can also use Mysql database data through jdbc connection; Spark can modify and delete database data, while HDFS can only append and full table data delete;
(5) Spark data processing speed spikes MR in Hadoop;
(6) The design pattern of Spark processing data is not the same as MR. Hadoop reads data from HDFS and writes intermediate results to HDFS through MR; then reads data from HDFS again to perform MR, and then flashes to HDFS. This process Involving multiple disk drop operations and multiple disk IO, the efficiency is not high; Spark's design mode is to read and store data in the cluster, and store and operate in memory until all operations are completed before storing it in the cluster;
(7) Spark is a high-efficiency fast calculation engine due to the low efficiency of MR in Hadoop. The batch processing speed is nearly 10 times faster than MR, and the data analysis speed in memory is nearly 100 times faster than Hadoop (from the official website description);
(8) RDDs in Spark are generally stored in memory. If the memory is not enough to store data, disks will be used to store data at the same time. Disaster recovery can be achieved through mechanisms such as blood connection between RDDs and data stored in memory to cut off blood relationship. Data can be recovered when data is lost; this is similar to Hadoop. Hadoop is based on disk read and write, and the inherent data is recoverable;
(9) Spark introduces the concept of in-memory cluster computing. Data sets can be cached in memory during in-memory cluster computing to reduce access latency and complement 7;
(10) Good fault tolerance can be achieved through DAG graphs in Spark.
3. The superiority of Spark over Hadoop
(1) Spark is based on RDD, the data is not stored in RDD, but is converted by RDD, and through the decorator design pattern, the blood relationship and type conversion are formed between the data;
(2) Spark is written in scala language, which is more concise than Hadoop program written in java language;
(3) Compared to Hadoop, which only provides Map and Reduce operations for data calculation, Spark provides a wealth of operators, which can convert operators and RDD action operators through RDD to implement many complex algorithm operations. The algorithm needs to be written by yourself in Hadoop, but it is directly encapsulated in Scala language in Spark and used directly.
(4) For the calculation of data in Hadoop, there is only one Map and Reduce stage for a job. For complex calculations, multiple MRs need to be used, which involves placement and disk IO, and the efficiency is not high; while in Spark, a job Can contain multiple RDD conversion operators, and can generate multiple Stages during scheduling to achieve more complex functions;
(5) Intermediate results in Hadoop are stored in HDFS, and each MR needs to be flash-called, and Spark intermediate results are stored in memory first. If the memory is not enough, they will be stored in the disk and not put in HDFS to avoid a lot of IO and flash read operations;
(6) Hadoop is suitable for processing static data, and has poor processing capacity for iterative streaming data; Spark caches the processed data in memory to improve the performance of processing streaming data and iterative data;
4. Three major distributed computing systems
Hadoop is suitable for processing offline static big data;
Spark is suitable for processing offline streaming big data;
Storm / Flink is suitable for processing online real-time big data.