I believe a lot of people, like me, can always see the following statements printed out in the console:
INFO Externalappendonlymap:thread 94 Spilling in-memory map of 63.2 MB to disk (7 times so far)
After inquiry, excerpt into the following:
Appendonlymap/externalappendonlymap is widely used in spark, such as the reduce phase of shuffle in joins, and Combinebykey operations.
Appendonlymap
Appendonlymap is a map of Spark's own implementation and can only add data, not remove. The map is the use of open addressing method of two times detection method, without the HashMap, etc. should be space-saving, improve performance.
The a=keyi in the array, A[i+1]=valuei, uses two positions to store the kv pair.
Growthreshold=load_factor * Capacity, when the added element exceeds this value, the array will be grow, capacity doubled, and all kv will be rehash redistributed position.
Main methods:
1.apply
That is, when you call Map (key) to access value. Find the location according to the hash of key (specifically murmur3_32), if the target location key is not the same as the key to find, then use two probing method to continue searching until found.
2.update
Locate the value of the corresponding key, overwriting the original value with the new value.
3.changeValue
More ways to do it in spark
The core of the method is actually the externally transmitted updatefunc (Hadvalue, oldValue), Updatefunc generally when Hadvalue is False Createcombiner (v) as the new value, Hadvalue is True when Mergevalue (oldvalue,v), adds v to OldValue.
Iterator
This method is typically called in the compute of the RDD, as the iterator of the RDD operation, where its downstream RDD can be a data source. The main implementation is the Hasnext and next methods, both of which call the NextValue method.
NextValue starts the loop from POS (initialization is 0) and returns the result until data (2*pos)!=null is found.
Hasnext is to Judge NextValue ()!=null.
Next is to get the return value of NextValue () and will pos+=1.
Destructivesortediterator
into a normal array and sort the KV by key. The feature of the map has been lost. Mainly used to export the map sort to disk when the row is out.
The realization idea is that the value of the original array is constantly moving toward the left end of the array, and the original two-position kv (K,V) takes up only one position, and then the array is sorted by key in KV. The sorting method is Kccomparator, which is sorted by the hashcode of the key. It then creates a iterator whose hasnext and next traverse the new array accordingly.
Earlier versions of Spark used APPENDONLYMAP to achieve aggregation of shuffle reduce phase data, which is fine when the amount of data is small, but consumes a lot of memory when the amount of data is large, and may eventually be oom. So, starting with spark 0.9, Externalappendonlymap was introduced to replace Appendonlymap.
Externalappendonlymap
Note: Externalappendonlymap is enabled when Spark.shuffle.spill=true, the default is true. is False when Appendonlymap is enabled
Externalappendonlymap also maintains a sizetrackingappendonlymap (inherited from Appendonlymap) in memory, spill to disk when the number of map elements exceeds a certain value. The last Externalappendonlymap actually maintains a memory map:currentmap as well as multiple diskmap:spillmaps.
Main properties and Parameters:
Currentmap
Sizetrackingappendonlymap, inherited from Appendonlymap. is the Externalappendonlymap memory map.
Spilledmaps
New Arraybuffer[diskmapiterator], each diskmapiterator points to the corresponding spill file data on disk.
Maxmemorythreshold
This value determines the sum of the size of the Currentmap for the task running concurrently on the worker, that is, num (running tasks) * Size (Currentmap of each task). This value is determined by spark.shuffle.memoryFraction and spark.shuffle.safetyFraction and is calculated as follows:
Val Maxmemorythreshold = { = sparkconf.getdouble ("spark.shuffle.memoryFraction" 0.3= sparkconf.getdouble ("spark.shuffle.safetyFraction" 0.8// The worker's Memory *0.24}
Insert
The main method for inserting kv pairs.
Shouldspill is whether the remaining space is enough for currentmap to expand, enough to double the size, if not enough, will currentmap spill to disk.
There is a need to determine whether shouldspill judgment is required, and the specific logic is as follows:
Numpairsinmemory > Trackmemorythreshold && currentmap.atgrowthreshold
Numpairsinmemory is the number of KV that has been added, and the trackmemorythreshold is a fixed value of 1000. That is, the first 1000 elements can be placed directly into the currentmap without the occurrence of spill.
Since the number of KV Currentmap initially accommodates 64, Currentmap will occur several times before numpairsinmemory > Trackmemorythreshold grow. When Numpairsinmemory > Trackmemorythreshold, then currentmap this time to arrive at the Growthreshold will be Shouldspill judgment.
- When this result is false, the condition that needs to be Shouldspill judged is not met, then the Currentmap.changevalue (key, update) is updated to Currentmap.
- When this result is true, the Shouldspill to disk judgment is required.
The specific steps of the Shouldspill judgment are: Determine if there is enough memory to double the Currentmap size based on the size of the maxmemorythreshold and the currentmap of other tasks that are currently running.
Val threadId == Shufflememorymap. Get = maxmemorythreshold--Previouslyoccupiedmemory.getorelse (0L)) // assume map growth factor is 2x 2
- Shouldspill=false: Let Shufflememorymap (threadId) = Mapsize * 2, that is, let the current any occupy twice times the space. The expansion of CURRENTMAP will occur after the Currentmap.changevalue.
- Shouldspill=true: Perform spill operation.
Spill
Write the Currentmap to disk. The specific steps are:
1, through Currentmap.destructivesortediterator (Kccomparator) will currentmap into the hashcode of the key to sort the array, and encapsulated into the corresponding iterator.
2, traversing 1 to get the iterator, the KV write to Diskblockobjectwriter, but the amount of write Objectswritten to serializerbatchsize (batch write to the file number of records, Writer.flush () writes the previous data to a file when it is controlled by the spark.shuffle.spill.batchSize, which defaults to 10000 and is too small to perform when writing a file. And spill to the size of the disk to batchsizes, batchsizes record the data size of each spill, easy to read later (because the bulk write to disk is compressed serialization, so read to read and write the same amount of data can be normal decompression deserialization, So batchsizes is very important)
3. Repeat 2 until all currentmap data is written to the file.
4, generate a diskmapiterator (for reading the file data), will be added to the spillmaps. This will place the batchsizes in Diskmapiterator and read the data from the file.
4, Reset work:
- Generate a new currentmap.
- Shufflememorymap (Thread.CurrentThread () getId) =0 is about to use Currentmap capacity 0.
- Numpairsinmemory Reset to 0.
Iterator
This method is typically called in the compute of the RDD, as the iterator of the RDD operation, where its downstream RDD can be a data source.
- When Spillmaps is empty, that is, only currentmap, never spill to disk, call Currentmap.iterator directly ()
- When the spillmaps is not empty, the outer-row process is externaliterator (similar to the sort phase of the Hadoop reduce and the memstore, storefile traversal of HBase)
Externaliterator
The main idea of the outer row: Each iterator has been ordered by Key.hashcode, using a priority queue to save each iterator, Hasnext is to see whether the priority queue has elements, Next is the combine that returns all the value of the smallest key corresponding to the current minimum hashcode, i.e. (Minkey,mincombiner).
The specific implementation is as follows:
1. Each iterator: iterator formed by Currentmap.destructivesortediterator and Diskmapiterator in Spillmaps
2. The priority queue is Mergeheap = new mutable. The main methods of Priorityqueue[streambuffer],streambuffer are as follows:
Private Case Class Streambuffer (iterator:iterator[(k, c)], pairs:arraybuffer[(k, c)]) extends comparable[ Streambuffer] { def isEmpty = Pairs.length = = 0 //Invalid If there is no more pairs in this stream def Minke Yhash = { assert (Pairs.length > 0) pairs.head._1.hashcode () } override Def compareTo (other: Streambuffer): Int = { //descending order because mutable. Priorityqueue dequeues The max, not the Min if (Other.minkeyhash < Minkeyhash)-1 else if (Other.minkeyhash = = MinK Eyhash) 0 else 1 }}
Streambuffer is the result of a iterator, and the iterator aggregated by a key.hashcode. Its compareto determines its position in the mergeheap. The Key.hashcode of Streambuffer are the same, so that Minkeyhash can take a random line from the dataset it stores. Here will let hashcode the same two key at the same time in the Streambuffer, that is, the key is not the same, there will be a problem, the following mergeifkeyexists will make key whether the same judgment.
3, the various iterator turn into streambuffer, this process needs to obtain each iterator the smallest keyhash corresponds to all kv pairs, the concrete implementation is Getmorepairs method.
Private def getmorepairs (it:iterator[(k, C)]): arraybuffer[(k, c)] = { new arraybuffer[(K, C)] if (it.hasnext) { var kc = it.next () + = KC = Kc._1.hashcode () while (It.hasnext && kc._1.hashcode () = = minhash) { /c17>= it.next () + = kc } } kcpairs}
The method is simple enough to get the first key.hashcode that is the smallest minhash (because iterator has been ordered by Key.hashcode) and then obtains and minhash the same all kv pairs.
4, whether the Hasnext:mergeheap priority queue is empty
5, Next: The core logic of the outer row.
A, Mergeheap.dequeue () Streambuffer out the queue at the top of the queue and adds it to the mergedbuffers (Mergedbuffers in order to record the streambuffer of the team for the next round to continue joining), Get Minhash, as well (Minkey, Mincombiner).
B, and then go to the remaining streambuffer to get the same kv pair Minhash and merge with (Minkey, Mincombiner). From the top of the queue constantly dequeue and minhash the same streambuffer and added to the mergedbuffers, each fetch to a streambuffer is the value of the merge, merge the specific call mergeifkeyexists.
Private def mergeifkeyexists (Key:k, Basecombiner:c, buffer:streambuffer): C = { var 0 while (I < buffer.pairs.length) { = buffer.pairs (i) if (k = = key) { Buffer.pairs.remove (i) //Basecombiner is the Mincombiner in B. The reason for this mergecombiners is that combiner return mergecombiners is generated when updatefunc in Currentmap ( Basecombiner, c) } 1 } Basecombiner}
Only kv that is the same as Minkey is selected to be merged with Mincombiner and removed from the corresponding streambuffer, otherwise it remains.
C, traverse Mergedbuffers namely dequeue of each streambuffer to determine whether it also has a KV pair, no then re-call Getmorepairs to obtain the next wave kv pair. The Streambuffer is then enqueue to mergeheap again to reorder. Of course, if a streambuffer or not kv, then the corresponding iterator has been traversed, do not need to add to the mergeheap.
D, return (Minkey,mincombiner)
Diskmapiterator
Reads data from disk files to form iterator.
Hasnext: Whether to read the end of the file
Next: First call Nextbatchstream () to read the data of Batchsizes.remove (0), the amount of data currently being read into BufferedStream, and then each time next receives a KV pair from the cache, When the data in the cache is finished, call Nextbatchstream () to re-read the block data in bulk from the file
Spark's outer row: Appendonlymap and Externalappendonlymap