Spark Sort Based Shuffle Memory analysis

Source: Internet
Author: User
Tags hash shuffle

The shuffle phase in a distributed system is often very complex, and there are many branching conditions, and I can only describe it in terms of the lines I care about. There will certainly be a lot of fallacies, I will follow my own understanding of the depth, and constantly update this article. Preface

Borrowing and Dong Shen a dialogue under the background:

Shuffle a total of three kinds, others are talking about the hash shuffle, which is the most primitive implementation, there were two versions, the first version of each map produced r files, altogether produces Mr Files, because the resulting intermediate file too large impact extensibility, the community proposed a second optimized version, Let a core map share files, reduce the number of files, so that a total of corer files, much better, but the number of intermediate files still increased linearly with the number of tasks, still difficult to deal with the big job, but the hash shuffle has been optimized to the head. In order to solve the problem of poor hash shuffle performance, and introduced the sort shuffle, fully learn from the MapReduce implementation, each map produces a file that completely solves the extensibility problem

The current sort Based Shuffle is as the default Shuffle type. Shuffle is a very complicated process, and any part of it is enough to write an essay. So here, I try to change the way, from a practical point of view, so that readers have two aspects of the harvest: the analysis of what links, which code may cause memory problems to control the parameters of related memory

Sometimes, we would rather slow down the program, or not oom, at least run first, hoping this article will allow you to achieve this goal.

At the same time, we will mention some of the class names, these classes are convenient for you to learn more in-depth, you can easily find them, to explore their own. Shuffle Overview

Spark's shuffle is divided into write,read two stages. We set up three concepts in advance:

Write corresponds to the shufflemaptask, the specific write operation Externalsorter to be responsible for

The Read phase is done by the Hashshufflereader in Shufflerdd. If the pulled data is too large to be landed, it is also done by Externalsorter.

Read will not be executed until all write has been completed. They were divided into two different stage stages.

That is, Shuffle Write, the Shuffle Read two phase may require the disk to be dropped, and the final sort merge sort is done via disk Merge. Shuffle Write Memory consumption analysis

The entry link for Shuffle Write is:

Org.apache.spark.scheduler.ShuffleMapTask
---> Org.apache.spark.shuffle.sort.SortShuffleWriter 
   --- > Org.apache.spark.util.collection.ExternalSorter

The memory bottleneck is actually org.apache.spark.util.collection.ExternalSorter. Let's see where this complex externalsorter is taking up memory:

First place:

private var map = new Partitionedappendonlymap[k, C]

We know that data is written in memory, memory is not enough to write disk. The map here is the memory that put the data.

This partitionedappendonlymap internally maintains an array, which is this:

private var data = new Array[anyref] (2 * capacity)

That is, he consumes not storage memory, so-called storage memory, refers to the memory managed by Blockmanager.

Partitionedappendonlymap, to land, then not hard to write the disk, so you need a buffer, and then the buffer once written to disk files. This buffer is determined by the parameter

spark.shuffle.file.buffer=32k

Controlled. In the process of data acquisition, serialization deserialization is also required for space, so spark limits the number by controlling it with the following parameters:

spark.shuffle.spill.batchsize=10000

Assuming that a executor core can be used for C, then the required memory consumption is:

c * 32k + c * 10,000 record + c * Partitionedappendonlymap

In this case, the buffer to write the file is not a problem, and the serialized batchsize is not a problem, either tens of thousands of or a hundred thousand of record. How big is the C * partitionedappendonlymap? Let me give you a conclusion:

   C * Partitionedappendonlymap < Executorheapmemeory * 0.2 * 0.8 

How to get the above conclusion. The core store is to determine how much memory partitionedappendonlymap needs to occupy, and whether it can take up memory, it is determined by triggering the write disk action, because once the disk is written, the memory that Partitionedappendonlymap occupies will be released. Here is the logical code to determine whether to write the disk:

EstimatedSize = Map.estimatesize ()
 if (Maybespill (map, estimatedsize)) { 
          map = new Partitionedappendonlymap[k, C]
 }

Each time a record, will do a memory check, see how much memory partitionedappendonlymap. If this is the case, suppose to check the memory 1ms, 1kw is a terrible time. So it's definitely not going to work, so estimatesize is actually using the sampling algorithm.

Second, we do not want maybespill too time-consuming, so the Maybespill method has a lot of things to reduce time-consuming. Let's see what lines we've set up.

You will first determine whether you want to perform internal logic:

   Elementsread% = = 0 && currentmemory >= mymemorythreshold

A check will be performed every 32 times, and the current Partitionedappendonlymap currentmemory > Mymemorythreshold will be judged to be spill.

Where mymemorythreshold can get the initial value by the following configuration

Spark.shuffle.spill.initialMemoryThreshold =  5 * 1024 * 1024

The Shufflememorymanager will then be asked to 2 * Currentmemory-mymemorythreshold of memory, Shufflememorymanager is executor all the running task (Core) The memory that can be allocated is:

Executorheapmemeory * 0.2 * 0.8 

The above numbers can be changed by the following two configurations:

spark.shuffle.memoryfraction=0.2
spark.shuffle.safetyfraction=0.8

If enough memory is not available, the true spill operation is triggered.

See here, the above conclusions are obvious.

However, here we overlook a big problem, that is

EstimatedSize = Map.estimatesize ()

Why say it is a big problem, we said earlier, Estimatesize is approximate estimate, so it is possible to estimate that the actual memory will be far more than expected.

Specifically, we can see Org.apache.spark.util.collection.SizeTracker.

I give a conclusion here:

If you have large memory, the risk is higher, because Estimatesize does not always go to the real cache. It is done by sampling, and the sampling period is not fixed, but exponential growth, such as after the first sampling, PARTITIONEDAPPENDONLYMAP after 1.1 times of Update/insert operation before the second sampling, and then through the 1.1*. 1.1 Times after the third sampling, to recursion, assuming that your memory is large, the partitionedappendonlymap may have to pass hundreds of thousands of times after the update before the sampling, and then to calculate the new size, this time the hundreds of thousands of update brings new memory pressure, Your GC may have been overwhelmed by that.

Of course, this is a compromise, because it is really not possible to sample frequently.

If you do not want to have this problem, either replace the implementation of this class yourself, or you will

spark.shuffle.safetyfraction=0.8 

Set it to a smaller number. Shuffle Read Memory consumption analysis

The entry link for Shuffle Read is:

Org.apache.spark.rdd.ShuffledRDD
---> Org.apache.spark.shuffle.sort.HashShuffleReader
   --->  Org.apache.spark.util.collection.ExternalAppendOnlyMap
   --->  Org.apache.spark.util.collection.ExternalSorter

Shuffle Read is more complex, especially pulling data from individual nodes. But this is not the point. By process, there are: the iterator that gets the data to pull uses Appendonlymap/externalappendonlymap to do combine if you need to sort key, use Externalsorter

1 of these follow-up articles are listed separately. 3 We have already discussed this in the write phase. So the focus here is the second step, the Combine phase.

If you open the

Spark.shuffle.spill=true

Use Externalappendonlymap, otherwise use Appendonlymap. The difference between the two is that if the former memory is not enough, then drop the disk, the spill operation will occur, the latter if the memory is not enough, the direct oom.

Here we will focus on the analysis of Externalappendonlymap.

Externalappendonlymap is the object of the memory buffer data as follows:

private var currentmap = new Sizetrackingappendonlymap[k, C]

If the Currentmap object does not have memory applied to it, the spill action is triggered. The logic to determine if the memory is sufficient is exactly the same as shuffle Write.

When combine is done, Externalappendonlymap will return a iterator called Externaliterator, the data source behind this iterator is all spill files and current currentmap data.

Let's go inside, Externaliterator. The only memory-hogging object is this priority queue:

   Private Val mergeheap = new mutable. Priorityqueue[streambuffer]

The number of elements in the mergeheap equals the number of all spill files plus one. Structure of the Streambuffer:

Private Class Streambuffer (    
                    Val iterator:bufferediterator[(K, c)],    
                    Val pairs:arraybuffer[(K, c)])

Where iterator is just an object reference, pairs should save the first element in the iterator (or multiple if the hash has a conflict)

So mergeheap should not occupy any memory. Here we look at how much memory we should occupy. Still assuming that Corenum is C, then

  c * 32k + C  * mergeheap  + c * Sizetrackingappendonlymap  

So this part of the memory is still large sizetrackingappendonlymap, the same, his value also conforms to the following formula

C * Sizetrackingappendonlymap < Executorheapmemeory * 0.2 * 0.8

The purpose of Externalappendonlymap is to do combine, and then if you set the order, then Externalsorter is enabled to complete the sort.

After the use of shuffle write above, compared to everyone also has a certain understanding of externalsorter, at this time should occupy memory where the maximum does not exceed the following value:

c * Sizetrackingappendonlymap  + c * Partitionedappendonlymap

But even so, because they share a shufflememorymanager, it's theoretically only that big:

C * Sizetrackingappendonlymap <  executorheapmemeory * 0.2 * 0.8

Analysis here, we can make a summary: Shuffle read phase If the memory is not enough, there are two stages will drop the disk, respectively, is the combine and Sort phase. The corresponding will spill small files, and generate read. Shuffle Read Stage If the spill function is turned on, the memory control is basically guaranteed to be within executorheapmemeory * 0.2 * 0.8. something

If you are interested in the Sort Shuffle drop disk file, you can also look at this article Spark Shuffle write phase Disk file analysis

Author: I wish William
Links: Http://www.jianshu.com/p/c83bb237caa8
Source: Pinterest
Copyright belongs to the author. Commercial reprint please contact the author for authorization, non-commercial reprint please specify the source.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.