Flink principle and implementation: memory management

Source: Internet
Author: User
Tags data structures define function garbage collection int size serialization

Today, the JVM that is used by the Open source Framework (Hadoop,spark,storm) in the big data domain, and of course, includes Flink. The JVM-based data analysis engine is faced with the need to store large amounts of data in memory, which has to face several problems with the JVM: Low Java object storage density. An object that contains only a Boolean property occupies 16 bytes of memory: The object header occupies 8, the Boolean property occupies 1, and the alignment padding takes up 7. And actually just need a bit (1/8 bytes) enough. Full GC can greatly affect performance, especially for JVMs that have a large memory footprint to handle larger data, and the GC can reach seconds or even minutes. The OOM problem affects stability. OutOfMemoryError is a common problem with the distributed computing framework, where the size of all objects in the JVM exceeds the amount of memory allocated to the JVM, a outofmemoryerror error occurs, causing the JVM to crash, and the robustness and performance of the distributed framework are affected.

So for now, more and more big data projects are starting to manage JVM memory, like Spark, Flink, HBase, to get the same performance as C and avoid the occurrence of OOM. This article will discuss how Flink solves the above problems, including memory management, custom serialization tools, cache-friendly data structures and algorithms, out-of-heap memory, JIT compilation optimizations, and more. Active Memory Management

Instead of Flink a large number of objects on the heap, the object is serialized to a pre-allocated block of memory called Memorysegment, which represents a fixed-length memory (the default size is 32KB) and the smallest memory allocation unit in Flink. and provides a very efficient way to read and write. You can think of Memorysegment as a custom java.nio.ByteBuffer for Flink. Its bottom layer can be a normal Java byte array (byte[]), or it can be a bytebuffer that is applied outside the heap. Each record is stored in one or more memorysegment as a serialized form.

The Worker in Flink is called TaskManager, which is the JVM process used to run user code. TaskManager's heap memory is mainly divided into three parts:

Network buffers: A certain number of 32KB size buffer, mainly used for data transmission. It will be allocated when the TaskManager is started. The default number is 2048, which can be configured by Taskmanager.network.numberOfBuffers. (Read this article to learn more about the management of network buffer) Memory Manager Pool: This is a Memorymanager managed by a large collection of memorysegment. The algorithm in Flink (such as Sort/shuffle/join) applies memorysegment to the memory pool, saves the serialized data in it, and releases it back to the memory pool when it is exhausted. By default, the pool occupies a size of 70% of the heap's memory. Remaining (free) Heap: This part of the memory is reserved for user code and TaskManager data structures. Because these data structures are generally small, the memory is basically used for user code. From the GC's point of view, it can be seen here as the new generation, which means that this is primarily a short-term object generated by user code.

Note: The Memory Manager Pool is primarily used in batch mode. In steaming mode, the pool does not pre-allocate memory, nor does it request a block of memory to the pool. This means that the part of the memory can be used for user code. But the community is planning to use the pool in streaming mode.

The

Flink uses a DBMS-like sort and join algorithm to manipulate binary data directly, minimizing the overhead of serialization/deserialization. So the internal implementation of Flink is more like C + + rather than Java. If the data that needs to be processed exceeds the memory limit, some of the data is stored on the hard disk. If you want to manipulate multiple blocks of memorysegment like a large contiguous memory, Flink uses a logical view (Abstractpagedinputview) for easy operation. The following figure depicts how Flink stores the serialized data into a block of memory and how to store the data on disk when needed.

From the above we can conclude that Flink active memory management and direct manipulation of binary data have the following advantages: reduce GC pressure. The is obvious because all resident data exists in binary form in the Flink Memorymanager, and these memorysegment remain in the old age without being recycled by GC. Other data objects are basically short life-cycle objects generated by user code, which can be quickly reclaimed by Minor GC. As long as the user does not create a large number of cache-like resident objects, the size of the old age will not change, and the Major GC will never happen. Thus effectively reducing the pressure of garbage collection. In addition, the memory blocks here can also be out-of-heap memory, which can make the JVM memory smaller, speeding up garbage collection. The avoids oom. all run-time data structures and algorithms can only request memory through the memory pool, ensuring that the memory they use is fixed and not oom due to runtime data structures and algorithms. In the case of tight memory, the algorithm (Sort/join, etc.) efficiently writes a large chunk of memory to disk and then reads it back. Therefore, outofmemoryerrors can be effectively avoided. The saves memory space. The Java object has a lot of extra consumption on storage (as discussed in the previous section). If you store only the binary content of the actual data, you can avoid this part of the consumption. efficient binary Operations & cache-friendly calculations. The binary data is stored in a well-defined format and can be compared and manipulated efficiently. In addition, the binary form can put the relevant values, as well as hash values, key values and pointers, and other adjacent to put into memory. This makes the data structure more user-friendly and can get performance improvements from the L1/L2/L3 cache (explained in more detail below). Custom-tailored serialization framework for Flink

The

Current Java ecosystem provides a number of serialization frameworks: Java serialization, Kryo, Apache Avro, and more. However, Flink implements its own serialization framework. Because the data flows processed in Flink are usually of the same type, because the type of the DataSet object is fixed, you can save a large amount of storage space for the dataset by saving only one copy of the object schema information. At the same time, for fixed-size types, it can also be accessed by a fixed offset location. When we need to access an object member variable, through a custom serialization tool, we do not need to deserialize the entire Java object, but can simply deserialize the specific object member variable by the offset. If the object has more member variables, it can greatly reduce the creation overhead of the Java object and the copy size of the memory data.

The

Flink supports any Java or Scala type. Flink has made great strides in data types and does not need to implement a specific interface (like org.apache.hadoop.io.Writable in Hadoop), Flink can automatically identify data types. Flink Analysis of the type information of the return type of the Java-based Flink program UDF (User Define Function) Through the Java Reflection Framework, which analyzes Scala's Compiler based on Scala Flink Type information for the return type of the program UDF. Type information is represented by the  TypeInformation  class, and TypeInformation supports the following types: basictypeinfo: Any Java base type (boxed) or String type. Basicarraytypeinfo: Any Java primitive type array (boxed) or String array. Writabletypeinfo: An implementation class for any Hadoop writable interface. Tupletypeinfo: Any Flink Tuple type (supports Tuple1 to Tuple25). Flink tuples is a fixed-length fixed-type Java tuple implementation. Caseclasstypeinfo: Any scala caseclass (including Scala tuples). Pojotypeinfo: Arbitrary POJO (Java or Scala), for example, all member variables of a Java object, either the public modifier definition or the Getter/setter method. Generictypeinfo: Any of the previous types of classes cannot be matched.

The first six data types basically satisfy most of the flink programs, and for the first six types of datasets, Flink can automatically generate corresponding Typeserializer, which can be used to serialize and deserialize a dataset very efficiently. For the last data type, Flink uses Kryo for serialization and deserialization. In each typeinformation, the serializer is included, the type is automatically serialized through serializer, and then written to memorysegments with the Java unsafe interface. For data types that can be used as key, Flink also automatically generates Typecomparator, which assists in compare, hashing, and so on, directly to the serialized binary data. For a combination of Tuple, Caseclass, POJO and other combinations, its typeserializer and typecomparator are also combined, and the corresponding serializers and comparators are delegated to them when serialized and compared. The following illustration shows the serialization process for an inline Tuple3 object.

It can be seen that this serialization mode of storage density is quite compact. where int is 4 bytes, double is 8 bytes, pojo multiple one byte Header,pojoserializer is only responsible for serializing the header and delegating the serializer of each field to serialize the field.

The Flink type system can easily extend the custom typeinformation, serializer, and comparator to improve the performance of the data types when serializing and comparing them. Flink How to manipulate binary data directly

Flink provides operations such as group, sort, join, which require access to massive amounts of data. Here, we take the sort example, which is a very frequent operation in Flink.

First, Flink will apply for a batch of memorysegment from Memorymanager, and we call this batch of memorysegment a sort buffer to hold sorted data.

We'll divide the sort buffer into two areas. An area is used to hold all the object's complete binary data. Another area is used to hold pointers to full binary data and fixed-length serialized key (Key+pointer). If the key that needs to be serialized is a variable-length type, such as a string, its prefix is serialized. As shown in the figure above, when an object is added to the sort buffer, its binary data is added to the first area, and the pointer (and possibly the key) is added to the second area.

There are two purposes for storing the actual data and pointers by adding long keys separately. First, the commutative fixed-length block (key+pointer) is more efficient, without exchanging real data or moving other keys and pointer. Second, this is cached-friendly because key is stored continuously in memory and can greatly reduce the cache miss (explained later in detail).

The key to sorting is more than size and swapping. In Flink, the key size is used first, so that the binary key can be compared directly without the need to deserialize the entire object. Because key is fixed-length, if the key is the same (or no binary key is provided), then the real binary data must be deserialized and then compared. After that, you only need to swap key+pointer to achieve the sort effect, and the real data does not move.

Finally, access to the sorted data can be accessed sequentially along the ordered Key+pointer area, through pointer to find the corresponding real data, and write to memory or external (more details can be seen in this article Joins in Flink). Cache-friendly data structures and algorithms

As disk IO and network IO become more and more fast, CPUs are becoming a bottleneck in the big data world. The speed at which data is read from the L1/L2/L3 cache is several levels faster than reading data from main memory. Performance analysis reveals that a large portion of CPU time is wasted waiting for data to come from main memory. If the data can be cached from the L1/L2/L3, then these wait times can be greatly reduced and all the algorithms will benefit.

As we discussed above, Flink stores the data that needs to be manipulated in the algorithm (such as the key in sort) through a custom serialization framework, while the full data is stored elsewhere. Because Key+pointer is more likely to be loaded into the cache for complete data, this greatly improves the cache hit ratio, which improves the efficiency of the underlying algorithm. This is completely transparent to the upper application and can be fully enjoyed by cache-friendly performance improvements. toward out -of-heap memory

Flink memory management mechanism based on heap memory has been able to solve many of the JVM's existing problems, why also introduce out-of-heap memory. It takes a long time for the JVM to start large memory (hundred GB), and the GC dwell time is also very long (minutes). With out-of-heap memory, you can significantly reduce heap memory (only one piece of the remaining heap is allocated), so that taskmanager scaling to hundreds of gigabytes of memory is not a problem. Efficient IO operation. When there is a write disk or network transmission outside the heap is zero-copy, and heap memory is required, at least one copy at a time. Out-of-heap memory is shared between processes. That is, the data is not lost even if the JVM process crashes. This can be used for failback (Flink temporarily does not take advantage of this, but is likely to do so in the future).

But the powerful things always have their negative side, otherwise why do we not all use the heap outside memory? Heap memory is much simpler to use, monitor, and Debug. Out-of-heap memory means more complexity and more hassle. Flink sometimes need to allocate short-life-cycle memorysegment, and the application will be cheaper on the heap. Some operations will be a little faster on heap memory.

Flink uses Bytebuffer.allocatedirect (numbytes) to request out-of-heap memory and uses Sun.misc.Unsafe to manipulate out-of-heap memory.

Based on the excellent design of Flink, it is very convenient to implement out-of-heap memory. Flink turns the original memorysegment into an abstract class and generates two subclasses. Heapmemorysegment and Hybridmemorysegment. It is also very easy to understand the literal meaning that the former is used to allocate heap memory, which is used to allocate out-of-heap memory and heap memory . Yes, you are not mistaken, the latter can allocate both out-of-heap memory and heap memory. Why design it like this.

First assume that the hybridmemorysegment only provides allocation of out-of-heap memory. In the 2nd of the above-heap memory shortage, Flink sometimes needs to allocate a short life-cycle buffer, which is more efficient with heapmemorysegment. So when using out-of-heap memory, in order to meet the heap memory requirements, we need to load two subclasses at the same time. This involves the problem of JIT compilation optimization. Since the previous memorysegment is a separate final class, there are no subclasses. At JIT compile time, all methods to invoke are deterministic, and all method invocations can be de-virtualized (de-virtualized) and inline (inlined), which can greatly improve performance (memroysegment usage is quite frequent). However, if two sub-classes are loaded at the same time, then the JIT compiler will only know which subclass to run to when it is actually running, so it cannot be optimized in advance. The actual test performance gap is around 2.7.

The Flink uses two scenarios:

Scenario 1: Only one memorysegment implementation is loaded

All of the short and long life-cycle memorysegment in the code instantiate one of the subclasses, and the other subclass is not instantiated at all (using Factory mode to control). After running for a period of time, the JIT realizes that all the methods that are called are deterministic and then optimizes.

Scenario 2: Provides an implementation that can handle both heap memory and out-of-heap memory

This is hybridmemorysegment, which can handle both heap and out-of-heap memory, so there is no need for subclasses. Here Flink gracefully implements a code that can operate both the heap and the out-of-heap memory. This is largely due to a range of methods provided by Sun.misc.Unsafe, such as the Getlong method:

Sun.misc.Unsafe.getLong (Object reference, long offset)
If reference is not empty, the address of the object is taken, plus the trailing offset, which takes 8 bytes from the relative address and gets a long. This corresponds to the heap memory scenario. If reference is empty, offset is the absolute address to be manipulated and the data is fetched from that address. This corresponds to the scenario of out-of-heap memory.

Here we look at the implementation of Memorysegment and its subclasses.

Public abstract class Memorysegment {//Heap memory Reference protected final byte[] heapmemory;
  
  Out-of-heap memory addresses protected long address;
    Heap Memory Initialization memorysegment (byte[] buffer, Object owner) {//some priori checks ... this.heapmemory = buffer;
    this.address = Byte_array_base_offset;

  ...
  } 
    Initialization of out-of-heap memory memorysegment (long offheapaddress, int size, Object owner) {//some priori checks ... this.heapmemory = null;
    this.address = offheapaddress;
  
  ...
  }
    Public final long Getlong (int index) {final long pos = address + index; if (Index >= 0 && pos <= addressLimit-8) {//This is where we are concerned, using Unsafe to operate On-heap & Off-heap R
    Eturn Unsafe.getlong (Heapmemory, POS);
    } else if (Address > Addresslimit) {throw new IllegalStateException ("segment has been freed");
    } else {//index is in fact invalid throw new indexoutofboundsexception ();

}
  }
  ...
} Public final class Heapmemorysegment extends Memorysegment {
  An additional reference to heapmemory, used to check the private byte[] memory for an array of bounds;
    Heap memory can only be initialized heapmemorysegment (byte[] memory, Object owner) {super (Objects.requirenonnull (memory), owner);
  This.memory = memory;

 }
  ...
}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.