Overview
In the field of distributed real-time computing, how to make the framework/engine to be efficient enough to access and process large amounts of data in memory is a very difficult problem. Flink is undoubtedly doing very well in dealing with this problem, Flink's autonomic memory management design may be higher than its own popularity. Just recently in the study of Flink source, so open two articles to talk about Flink memory management design.
The highlights of Flink's memory management are in the form of Java-based programs that are implemented in Scala, and a functional programming language that follows the JVM specification and relies on the JVM's interpretation, but manages the memory independently and does not rely entirely on the JVM's memory management mechanism. Its advantage lies in the flexibility, for big data scenarios, to avoid (uncontrolled) frequent GC-led performance fluctuations, to some extent out of the limitations of the JVM, is a way of thinking on the development.
Basically we have divided the Flink memory design into two parts (following the package partitioning method):
- Basic data Structure (package:org.apache.flink.core.memory)
- Memory management mechanism (package:org.apache.flink.runtime.memory)
We will explain separately, this article mainly focuses on the basic data structure. Memory management mechanism Please wait for subsequent article analysis.
Is the diagram of all classes in the package:
Among them: MemorySegment
, HeapMemorySegment
, is the HybridMemorySegment
most critical of the three classes, we will focus on analysis.
Flink the type of memory abstracted out
Flink abstracts its managed memory into two types (the main abstract-based memory location):
- HEAP:JVM Heap Memory
- Off_heap: Non-heap memory
This is defined as an enumeration type in Flink: MemoryType
.
Memorysegment
The memory managed by Flink is abstracted as a data structure: MemorySegment
.
Thus, Flink provides two implementations of this:
- Heapmemorysegment: Managed memory or part of JVM heap memory
- Hybridmemorysegment:hybrid (On-heap or Off-heap) memorysegment, the memory may or may not be the JVM heap memory.
Related fields for memorysegment:
- UNSAFE: Used to operate on heap/non-heap memory, which is a non-secure API for the JVM
- Byte_array_base_offset: The starting index of the binary byte array, relative to the byte array object
- Little_endian: Boolean value, whether it is a small-end alignment (involves a byte order problem)
- Heapmemory: If it is a heap memory, it points to a reference to the memory that is accessed, or null if memory is not heap memory
- Address: The relative address of the byte array (if Heapmemory is null, which may be the absolute address of the off-heap memory, followed by an explanation)
- Addresslimit: Identify address end location (address+size)
- Size: Number of bytes in a memory segment
Where Little_endian gets the byte order of the current operating system, it is a Boolean value, and many subsequent put/get operations need to be judged by the Bigedian (big end) or Littleedian (small).
Questions about the byte order, if you do not understand, please Google
Enter the code topic to provide two constructors for on-heap memory and OFF-HEAP memory:
Also, a lot of get/put methods are provided, and most of these getxxx/putxxx call unsafe.getxxx/unsafe.putxxx directly or indirectly. These methods of handling different memory types are implemented in the public MemorySegment
.
More than that, of course, it's just part of it.
The specific memory accesses are implemented in two of their respective classes.
There are also three notable ways to look at the Memorysegment class:
Public Final void CopyTo(intOffset, memorysegment target,intTargetoffset,intNumBytes) {Final byte[] Thisheapref = This. heapmemory;Final byte[] Otherheapref = target.heapmemory;Final LongThispointer = This. address + offset;Final LongOtherpointer = target.address + targetoffset;if((numbytes | offset | targetoffset) >=0&& Thispointer <= This. addresslimit-numbytes && otherpointer <= target.addresslimit-numbytes) {Unsafe.copymemo Ry (Thisheapref, Thispointer, Otherheapref, Otherpointer, numbytes); }Else if( This. address > This. addresslimit) {Throw NewIllegalStateException ("This memory segment has been freed."); }Else if(Target.Address > Target.addresslimit) {Throw NewIllegalStateException ("Target memory segment has been freed."); }Else{Throw NewIndexoutofboundsexception (String.Format ("offset=%d, targetoffset=%d, numbytes=%d, address=%d, targetaddress=%d", offset, targetoffset, numbytes, This. Address, target.address)); } }
This is a bulk copy method that starts copying NumBytes length bytes from the offset offset of the current memory segment to the beginning of targetoffset from the target memory segment.
PublicFinalint Compare(Memorysegment SEG2,intOffset1,intOffset2,intLen) { while(Len >=8) {LongL1 = This. Getlongbigendian (OFFSET1);LongL2 = Seg2.getlongbigendian (Offset2);if(L1! = L2) {return(L1 < L2) ^ (L1 <0) ^ (L2 <0) ? -1:1; } Offset1 + =8; Offset2 + =8; Len-=8; } while(Len >0) {intB1 = This.Get(OFFSET1) &0xFF;intB2 = seg2.Get(Offset2) &0xFF;intCMP = B1-B2;if(CMP! =0) {returncmp } offset1++; offset2++; len--; }return 0; }
A self-implemented comparison method for comparing data with the current memory segment offset Offset1 length len with the SEG2 offset starting bit offset2 length of Len.
Here are two while loops:
The first while is a byte-by-bit comparison, and if Len's length is greater than 8, the long-shaped representation of the data from the respective starting offset is compared, and if the equality is shifted back by 8 bits (one byte) and the length is reduced by 8, the cycle repeats.
The second loop compares the last remaining less than one byte (eight bits), so it is a bitwise comparison
Public Final void swapbytes(byte[] Tempbuffer, Memorysegment SEG2,intOffset1,intOffset2,intLen) {if((Offset1 | offset2 | len | (Tempbuffer.length-len)) >=0) {Final LongThispos = This. address + Offset1;Final LongOtherpos = seg2.address + Offset2;if(Thispos <= This. Addresslimit-len && Otherpos <= seg2.addresslimit-len) {//This, temp bufferUnsafe.copymemory ( This. Heapmemory, Thispos, Tempbuffer, Byte_array_base_offset, Len);// Other---Unsafe.copymemory (Seg2.heapmemory, Otherpos, This. Heapmemory, Thispos, Len);//Temp buffer , otherUnsafe.copymemory (Tempbuffer, Byte_array_base_offset, Seg2.heapmemory, Otherpos, Len);return; }Else if( This. address > This. addresslimit) {Throw NewIllegalStateException ("This memory segment has been freed."); }Else if(Seg2.address > Seg2.addresslimit) {Throw NewIllegalStateException ("Other memory segment have been freed."); } }//index is in fact invalid Throw NewIndexoutofboundsexception (String.Format ("offset1=%d, offset2=%d, len=%d, buffersize=%d, address1=%d, address2=%d", Offset1, Offset2, Len, Tempbuffer.length, This. Address, seg2.address)); }
This method is used to exchange a piece of data from two memory segment. In addition to some boundary value judgments, it is a data exchange with the help of temporary variables, but unsafe.copyMemory
instead of the assignment number.
Below we will explore the two types of memory management provided by Flink:on-heap and off-heap.
Heapmemorysegment
Memory segment, which is based on JVM heap memory (ON-HEAP), is Flink's earliest self-management mechanism. The class internally defines a reference to a byte array pointing to that memory segment, and the implementations of the previously mentioned MemorySegment
abstract methods in the class are manipulated based on references to the internal byte array to obtain built-in rather than additional self-implementation checks (such as array out-of-bounds, etc.). What does that mean? When you define
privatebyte[] memory
When the memory points MemorySegment
to the heapmemory in the implementation of this method similar to the following
publicfinalbyteget(int index) { returnthis.memory[index]; }
You can use the JVM's own mechanism to determine whether index is between 0 and length-1. Instead of using attributes such as address to determine the scope of the index, such as the above method is implemented in this way HybridMemorySegment
Public byteGetint Index) {Final Longpos = address +Index;if(Index>=0&& Pos < Addresslimit) {returnUnsafe.getbyte (Heapmemory, POS); }Else if(Address > Addresslimit) {Throw NewIllegalStateException ("segment has been freed"); }Else{//index is in fact invalid Throw NewIndexoutofboundsexception (); } }
This implementation must check the boundary value so itself.
Because it is the JVM's heap memory, many method calls can take advantage of the JDK's own methods, such as array copies
@Override Public Final void Get(intIndexbyte[] DST,intOffsetintLength) {//System arraycopy does the boundary checks anyways, no need to check extraSystem.arraycopy ( This. Memory, index, DST, offset, length); }@Override Public Final void put(intIndexbyte[] SRC,intOffsetintLength) {//System arraycopy does the boundary checks anyways, no need to check extraSystem.arraycopy (SRC, offset, This. memory, index, length); }
Other methods of implementation are very conventional, there is not much to mention the point of the place.
Hybridmemorysegment
This is another memory management implementation: It supports both on-heap memory and off-heap memory. At first glance, it seems strange, because there is already a realization of the on-heap, why do you want to engage in a hybrid, rather than off-heap? And it can be confusing to manipulate two different areas of memory in a class.
So let's take a look at how Flink is "graceful" in avoiding chaos. This is also due to the range of methods provided by the non-Secure operation class (unsafe) provided by the JVM
...)
These methods have the following characteristics:
(1) If the object o is not NULL, and the subsequent address or position is relative, then the relative position of the current object (such as an array) is directly manipulated, since this object is not NULL, then this situation naturally satisfies the on-heap scene;
(2) If the object o is null, and the subsequent address is the absolute address of a block of memory, the invocation of these methods is also equivalent to manipulating the block of memory. Here the object o is null, and the memory block that is being manipulated is not the JVM heap memory, which satisfies the off-heap scenario.
Remember MemorySegment
the two properties that we mentioned when we introduced the class:
These two attribute combinations can be adapted to the above two scenarios. Moreover, MemorySegment
one of the construction parameters:offheapaddress , has basically pointed out that the constructor is specifically for the off-heap.
MemorySegment
Some public implementations for specific data types are given, most of which also call the unsafe method with the above characteristics, so in fact, it has MemorySegment
Hybrid meaning.
The problem is, so how does Flink get the memory address of some off-heap data? The answer is in the following code snippet
/** the reflection fields with which we Access the Off-heap pointer from direct bytebuffers */ private static final Field Address_field; static {try {Address_field = Java.nio.Buffer.class.getDeclaredField ( "address" ); Address_field.setaccessible (true ); } catch (Throwable t) {throw new runtimeexception ( "Cannot initialize Hybridme Morysegment:off-heap memory is incompatible with the this JVM. ", T); } }
The
Obtains the field representation of the address property by reflecting the buffer class, and then
private Static long getaddress (bytebuffer buffer {if (buffer = = null ) {throw new nullpointerexception ( "buffer is null" ); } try {return (Long) Address_field. get (buffer); } catch (Throwable t) {throw new runtimeexception ( "Could not access direct byte buffer Address. ", T); } }
The address of the off-heap that gets a buffer is indicated.
Although the above MemorySegment
two attributes are combined with the particularity of the unsafe correlation method, the implementation of hybridmemorysegment is clear and concise. But it also maintains a reference to the OFF-HEAP data it manages: Offheapbuffer. On the one hand is to hold that memory space is not released, on the other hand is to implement some of their own methods.
Memorysegmentfactory
MemorySegmentFactory
is used to create MemorySegment
, and Flink seriously recommends using it to create MemorySegment
instances instead of manual instantiation. The purpose of this is to allow the runtime to have only one instance of the subclass implementation of a memorysegment , rather than MemorySegment
an instance of the two subclasses, as this would allow the JIT to have the overhead of loading and selecting, resulting in a significant decrease in performance. In this regard, Flink official blog dedicated to a blog to explain their comparison and test plan, see the final reference.
Memorysegmentfactory Related class diagrams
Such as:
Obviously, this is the factory method pattern in design mode.
MemorySegmentFactory
There is an internal interface class in which the Factory
MemorySegment
internal classes of the two implementation classes implement the interface individually and define their own Factory
implementations. This block is not special, just to prevent external direct instantiation and their HybridMemorySegmentFactory
HeapMemorySegmentFactory
respective constructors are set to private.
MemorySegmentFactory
The class provides a Factory
method similar to an interface, or a layer of logic that specifies a Factory
specific instance (basically each method calls the ensureInitialized
method first):
privatestaticvoidensureInitialized() { ifnull) { factory = HeapMemorySegment.FACTORY; } }
As can be seen from the above, the MemorySegmentFactory
default is to use an instance of the HeapMemorySegment
class to implement MemorySegment
.
View-built abstraction over memorysegment
In addition to MemorySegment
the related implementations, Flink's core package also provides a higher abstraction built on top of it MemorySegment
: DataView (Data View).
Data View-related class diagrams:
There are two interfaces, respectively, for the output view DataOutputView
(data write related), input view DataInputView
(data read-related). Each of the two interfaces provides a position-based seek action (that is, a data read and write operation at the specified location), respectively. In addition, there are two implementation classes, each of which wraps the corresponding stream interface. This piece is nothing special, do not do too much explanation.
The above is an interpretation of the data structure part of Flink's self-managed memory.
Reference
[1]https://flink.apache.org/news/2015/09/16/off-heap-memory.html
Focus on Flink public to get more flink of the special interpretation
- Search Public Number: Apache_flink
- Scan code Attention:
The basic data structure of flink memory management source Interpretation