The high performance of Apache Spark depends in part on the asynchronous concurrency model it employs (this refers to the model used by the Server/driver side), which is consistent with Hadoop 2.0 (including yarn and MapReduce). Hadoop 2.0 itself implements an actor-like asynchronous concurrency model, implemented in the epoll+ state machine, while Apache Spark directly uses the open source software Akka, which implements the actor model with very high performance. Although both have a consistent concurrency model on the server side, different parallel mechanisms are used at the task level (specifically, spark tasks and MapReduce tasks): Hadoop MapReduce employs a multi-process model, and Spark employs a multithreaded model.
Note that the multi-process and multithreading in this article refers to the running mode of multiple tasks on the same node. Both MapReduce and Spark, on the whole, are multi-process: The MapReduce application is made up of multiple independent task processes, and the spark application's running environment is made up of a pool of temporary resources built by separate executor processes.
A multi-process model facilitates fine-grained control over the resources consumed by each task, but consumes more startup time and is not suitable for running low-latency types of jobs, which is one of the reasons why MapReduce is widely criticized. The multithreaded model, in contrast, makes spark a good fit for jobs that run low-latency types. In summary, the tasks on the spark's same node run in a single JVM process in a multithreaded manner, bringing the following benefits:
1) task start speed, in contrast to the MapReduce task process slow start speed, usually need about 1s;
2) All tasks on the same node run in one process, which facilitates shared memory. This is ideal for memory-intensive tasks, especially for applications that require a large number of dictionaries to be loaded, saving memory considerably.
3) All tasks on the same node can be run in a single JVM process (Executor), and Executor's resources can be used continuously by multiple batches of tasks, will not be released after running some tasks, which avoids the time spent by each task to duplicate the request resources, for a very large number of applications, Can greatly reduce the running time. In contrast to the task in MapReduce: each task requests resources separately, is released immediately after use, cannot be reused by other tasks, although 1.0 supports JVM reuse to some extent to compensate for this problem, but 2.0 has not yet supported the feature.
Although Spark's threading model brings many benefits, there are also deficiencies, mainly:
1) because all tasks on the same node are running in one process, serious resource contention can occur, and it is difficult to fine-grained control of each task occupying resources. In contrast, MapReduce, which allows the user to set different resources for the map task and the reduce task individually, then fine-grained control the amount of resources consumed by the task, which facilitates the normal and smooth operation of large jobs.
The following is a brief introduction to the multi-process model of MapReduce and the multithreaded model of Spark.
(1) MapReduce multi-process model
1) Each task runs in a separate JVM process;
2) Different resources can be set separately for different types of task, currently supports two resources of memory and CPU;
3) When each task finishes running, it frees up resources that cannot be reused by other tasks, even the same type of task for the same job. That is, each task goes through the process of "request resources--run task–> to release resources."
(2) Spark multithreaded model
1) One or more executor services can be run on each node;
2) Each executor has a certain number of slots, indicating how many shufflemaptask or Reducetask can be run at the same time in the executor;
3) Each executor is run separately in a JVM process, and each task is a thread running in executor;
4) tasks within the same executor can share memory, such as a file or a data structure that is broadcast through a function sparkcontext#broadcast only once in each executor, and not as MapReduce, each task is loaded once ;
5) Once the executor is started, it will run continuously and its resources can be reused by the task until the Spark program is completed before releasing the exit.
In general, Spark uses the classic scheduler/workers model, and the first step for each spark application is to build a reusable pool of resources and then run all of the Shufflemaptask and Reducetask in this resource pool (note that Despite the flexibility of spark programming, which is no longer limited to writing mapper and reducer, only two types of tasks within the spark engine can represent a complex application, Shufflemaptask and Reducetask. The MapReduce application, in contrast, does not build a reusable pool of resources, but instead allows each task to dynamically request resources and release resources as soon as it runs out.
Apache Spark Quest: Multi-process model or multithreaded model? Go