Spark Tutorial: Architecture for Spark

Source: Internet
Author: User
Tags shuffle

Recently saw a post on the spark architecture, the author is Alexey Grishchenko. The students who have seen Alexey blog should know that he understands spark very deeply, read his "spark-architecture" this blog, a kind of clairvoyant feeling, from the JVM memory allocation to the Spark cluster resource management, step-by-step, A lot of feelings (Tengyun technology ty300.com). Therefore, in the weekend's spare time, the core content of this article translated into Chinese, and here to share with you. If there are words in the translation process, please also point out the mistake.


First look at a picture of the official spark 1.3.0, as follows:

In this picture, you will see a number of terms, such as "Executor", "task", "cache", "Worker Node", and so on. The original author says that when he started to learn spark, the picture was the only one that could be found (Spark 1.3.0), and the situation was very gloomy. More Unfortunately, this diagram does not convey some of the concepts inherent in spark. Thus, by continuing to learn, the author organizes his knowledge into a series, which is only one of the articles. Below is the core point.

Spark Memory allocation

Any spark program that works on your cluster or local machine is a JVM process (introductory basic tutorial qkxue.net). For any JVM process, you can use-XMX and-XMS to configure its heap size (heap sizes). The question is: how do these processes use its heap memory and why do you need it? The following is slowly unfolding around the problem.

First, take a look at the following spark JVM heap memory allocation diagram:

Spark-heap-usage.png

Heap Size

By default, when Spark starts, it initializes 512M of JVM heap memory. In a security perspective and to avoid oom errors, spark only allows 90% of the heap memory, which can be controlled by Spark's spark.storage.safetyFraction parameter. OK, you may have heard that spark is a memory-based tool that allows you to put data in memory. If you have read the author's spark misconceptions This article, then you should know that spark is not really a memory-based (in-memory) tool. It simply uses memory in the LRU cache (HTTP://EN.WIKIPEDIA.ORG/WIKI/CACHE_ALGORITHMS) process. So part of the memory is used on the data cache, which typically accounts for 60% of the Secure Heap memory (90%), which can also be controlled by configuring Spark.storage.memoryFraction. So, if you want to know how much data you can cache in spark, you can do this by summing all executor heap sizes and multiplying them by safetyfraction and Storage.memoryfraction, which by default is 0.9 * 0.6 = 0.54 , that is, 54% of the total heap memory is available for use by Spark.

Shuffle Memory

Next talk about shuffle memory, the calculation formula is "Heap Size" * spark.shuffle.safetyFraction * spark.shuffle.memoryFraction. The default value for Spark.shuffle.safetyFraction is 0.8 or 80%, and the default value for Spark.shuffle.memoryFraction is 0.2 or 20%, so you can finally use the JVM heap for shuffle. The memory size is 0.8*0.2=0.16, which is 16% of the total heap size. The question is, how does spark use this part of the memory? The official GitHub above has a more detailed explanation (https://github.com/apache/spark/blob/branch-1.3/core/src/main/scala/org/apache/spark/ Shuffle/shufflememorymanager.scala). In a sense, spark uses this part of memory for the shuffle phase to invoke other specific tasks. When shuffle is executed, sometimes you need to sort the data. In the sort phase, you usually also need a buffer-like cache to store the sorted data (remember, you can't modify the data already in the LRU cache, because that data might be reused). Therefore, a certain amount of RAM is required to store data blocks that have been sorted. If you don't have enough memory to sort, what do you do? Search in Wikipedia for "external sorting" (out of order) and read it carefully. Out-of-order allows you to categorize blocks of data and then merge the final results together.

Unroll Memory

About RAM finally to the "Unroll" memory, the total amount of RAM for the unroll process is calculated as: Spark.storage.unrollFraction * spark.storage.memoryFraction * Spark.storage.safetyFraction. By default it is 0.2 * 0.6 * 0.9 = 0.108,
That is, 10.8% of the heap size. Use it when you need to expand the block of data in memory. Why do I need unroll operation? In spark, data is allowed to be stored in both serialization (serialized) and deserialization (deserialized), which cannot be used directly for serialized data, so it must be unroll when used. Therefore, this part of RAM is used for unrolling operation of memory. Unroll memory and storage RAM are shared, that is, when you are performing unroll operations on the data, if you need RAM, and this time memory is not enough, then it may cause the undo to be stored in the Spark LRU cache less data blocks.

Spark Cluster Mode JVM Assignment

OK, as explained above, we should have a further understanding of the spark process and know how it leverages the memory in the JVM process. Now switch to the cluster, take yarn mode for example.

Spark-architecture-on-yarn

In the yarn cluster, it has a yarn resourcemananger daemon that controls the cluster resources (that is, memory) and a series of yarn node managers that run on each node of the cluster to control the use of node resources. From the yarn point of view, each node can be considered an assignable ram pool, and when you send a request resource to ResourceManager, it returns some NodeManager information that NodeManager will give you execution container, and each execution container is a JVM process that specifies the heap size when you send a request. The location of the JVM is managed by YARN Resourcemananger, and you do not have control rights. If a node has 64GB of RAM controlled by yarn (by setting the parameter YARN.NODEMANAGER.RESOURCE.MEMORY-MB in the Yarn-site.xml configuration file), when you request a executors of 10 4G memory, These executors may run on the same node, even if your cluster is too large to be useless.

When starting the spark cluster in yarn mode, you can specify the number of executors (-num-executors or spark.executor.instances parameters) and you can specify each executor Intrinsic memory size (-executor-memory or spark.executor.memory), you can specify the number of CPU cores used per executor (-executor-cores or Spark.executor.cores) , you can specify the number of cores that are assigned to each task (Spark.task.cpus), and you can specify the memory used on driver (-driver-memory or spark.driver.memory).

When you execute an application on a cluster, the job is cut into multiple stages, each stage is split into multiple tasks, each task is dispatched separately, and each executor JVM process can be considered a pool of task execution slots, each executor A Spark.executor.cores/spark.task.cpus execution execution slot will be set for your task. For example, there are 12 nodes running in the cluster's yarn NodeManager, each with 64G of memory and 32 CPU cores (16 hyper-threaded physical cores). Each node can launch 2 26G memory executor (the rest of the RAM is used for system programs, yarn nm and Datanode), and each executor has 12 CPU cores that can be used to execute the task (the rest for system programs, yarn nm, and Datanode) So that the entire cluster can process a machines * 2 executors per machine * cores per EXECUTOR/1 core = 288 task execution slots, which means that your spark cluster can run at the same time 288 t Ask, almost fully utilizing all the resources. Entire cluster memory for cached data 0.9 spark.storage.safetyFraction * 0.6 spark.storage.memoryFraction * machines * 2 executors per machine * All GB per executor = 336.96 GB. Actually not so much, but in most cases it's enough.

Here, you probably know how spark uses the JVM's memory and know what the execution slots of the cluster are. In relation to a task, it is the unit of work that spark executes and executes as a thread in the Exector JVM process. This is why spark job startup time is fast, starting a thread in the JVM rather than starting a separate JVM process block, and executing a mapreduce application in Hadoop starts multiple JVM processes.

Spark Partition

Let's talk about another abstract concept of spark "Partition". All data is sliced into multiple partion during the Spark program's operation. The question is what is a parition and how does it determine the number of partition? First, the size of the partition depends entirely on your data source. In Spark, most of the method used to read data can specify the number of partition in the generated RDD. When you read a file from HDFs, you use the InputFormat of Hadoop to specify that, by default, InputFormat returns each inputsplit mapped to a partition in the RDD. For most files on HDFs, each block of data generates a inputsplit that is approximately mb/128 MB in size. In approximate cases, the block boundaries of data on HDFs are counted in bytes (64MB blocks), but when the data is processed, it is sliced by record. For a text file, the characters that are sliced are line breaks, and for sequence files, it ends with blocks and so on. The more special is the compressed file, because the entire file is compressed, so can not be sliced by the line, the entire file only a inputsplit, so there will be only one parition in spark, in the processing of the need to manually repatition it.

Spark Tutorial: Architecture for Spark

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.