A Bloat-Aware Design for Big Data Applications

Source: Internet
Author: User
Tags gc overhead limit exceeded
Document directory
  • 3.2.1 design Transformation
  • 3.2.2 Execution Process
  • 3.2.3 running example

[Translator's preread]

In the world, there is nothing more to worry about than O & M. You can only say four copies of an 8-core 16 GB memory high-configuration machine... Why, for fear that the Java program cannot handle such a large memory. Practice has found that the JVM heap memory should be adjusted to 2 GB or above, and the huge overhead brought by GC should be very careful. This paper explores a solution from the theoretical and practical aspects. Its ideas are clear and thoroughly analyzed, and it is useful for your reference to enjoy Java convenience and stay away from GC troubles.

 

1 Overview

In the past decade, with the demand for data-driven business intelligence to continue to grow, various large-scale data-intensive applications have begun to flourish, and they often need to process tons of data (TP or PB scale ). Object-oriented languages, such as Java, are often used by developers to implement these applications. The primary reason is Java's rapid development cycle and rich community resources. Despite the simplicity of development with this language, the significant performance problems it brings are like a nightmare-there are various inherent inefficiency factors in the system during the hosted runtime, in addition, the impact of processing large-scale data in a limited memory space eventually led to an astonishing memory expansion and performance degradation.

This paper proposes a design paradigm of inflation awareness, aiming to develop efficient and scalable big data applications, even if they use object-oriented languages that require GC. First, we will learn several typical memory expansion modes. These modes are summarized from complaints from two widely used open-source big data applications. Then we will discuss a new design example to eliminate expansion. Through examples and experiments, we demonstrate that using this example for programming does not significantly increase development costs. Then we use the new design principles and convenient object-oriented design principles to achieve some common data processing tasks for comparison. The experiment results show that the new design example greatly improves the performance. Even when the data size is not so large, we can see a performance improvement of more than 2.5 times. As the data set expands, performance gains increase proportionally.

2. Memory Analysis of big data applications

In this chapter, we analyze the impact of using Java objects to express and process data on performance and scalability by studying Giraph and Hive popular data-intensive applications. Our analysis will go deep into the most basic problems-Time and Space: 1. A large amount of memory space consumed by object headers and object references, resulting in a low memory accumulation density, 2. GC efficiency caused by massive objects and references.

2.1 low accumulation Density

During Java runtime, each object requires a header space to support type management and memory management. An array object requires additional space to store its length. For example, in Oracle's 64-bit hot spot JVM, the header space of regular objects and arrays occupies 8 and 12 bytes respectively. In a typical big data application, the JVM heap often contains many small objects (such as Integer representing the record ID), and the overhead of the header space cannot be ignored. In addition, due to the large use of Object-oriented Data structures, the space will become increasingly inefficient. These data structures often use multi-level delegation to achieve their functions, resulting in a large amount of space being used to store pointers rather than real data. To measure the cost-effectiveness of space usage, we use a standard named Packing
It indicates the maximum amount of real data that can be stored in a fixed-size memory. Our analysis mainly targets big data application scenarios, that is, massive data flows through a fixed memory size in batches.
To analyze the heap memory stacking density of typical big data applications, we use PageRank algorithm (an application built based on Giraph) as an example. PR (PageRank) is a join analysis algorithm. It distributes weights to each vertex in a graph structure and is calculated by iteration Based on the weights of each vertex's inbound neighbor. This algorithm is widely used in the search engine's page ranking.
We run PR on different open-source big data computing systems, including Giraph, Spark, and Mahout, using a cluster of 6 units and 180 machines. Each machine has two quad-core Intel Xeon E5420 processors and 16 gb ram. The data set of the experiment (web graph dataset) is 70 GB in total and contains 1,413,511,393 vertices. We found that none of their workers can successfully process this data set and all of them crash in java. lang. outOfMemoryError: The data is partitioned on each machine (less than 500 MB), and the physical memory is sufficient.

We found that many developers encountered similar problems. For example, we see many OutOfMemoryError complaints in Giraph's user email. To locate the bottleneck, we decided to use PR for a quantitative analysis. Giraph contains an example of a PR Algorithm. Some data structures are described as follows:

1234567891011121314151617 public
abstract
class
EdgeListVertex< I
extends
WritableComparable,
V extends
Writable, E extends
Writable, M extends
Writable> extends
MutableVertex<I, V, E, M> {     private
I vertexId = null;
    private
V vertexValue = null;
    /** indices of its outgoing edges */    private
List<I> destEdgeIndexList;     /** values of its outgoing edges */    private
List<E> destEdgeValueList;     /** incoming messages from the previous iteration */    private
List<M> msgList;     ......
    /** return the edge indices starting from 0 */    public
List<I> getEdegeIndexes(){     ...
    
}

The processed graph in Giraph is labeled (vertex and link have values), and the link is directed. The EdgeListVertex class represents a vertex in the graph. In its fields, vertexId and vertexValue store the ID and vertex values. The destEdgeIndexList and destEdgeValueList fields reference the IDs and values of the outbound connections. MsgList contains messages sent from the last iteration. Figure 1 shows the structure of an EdgeListVertex object as the root vertex.


In the PR Implementation of Giraph, the real types of I, V, E, and M are LongWritable, DoubleWritable, FloadWritable, and DoubleWritable. In the figure, each line has the same weight, so the list referenced by destEdgeValueList is always empty. Assume that each vertex has m outbound connections and n messages on average. Table 1 shows the memory consumption statistics of a vertex Data Structure on the Oracle64-bit hot spot virtual machine. Each row in the table targets a class and shows the number of objects required in this data structure, the number of bytes used in the header space of these objects, and the number of bytes used by reference fields. The extra space overhead of a vertex is 16 (m + n) + 148 (that is, the sum of the header size and pointer size)
On the other hand, Figure 2 shows a theoretical memory layout where only necessary information is stored (no object is required ). In this scenario, a vertex requires m + 1 long value (vertex ID), n double values (messages ), and two 32-bit int values (outbound quantity and message quantity), which consume a total of 8 (m + n + 1) + 16 = 8 (m + n) + 24-byte memory. This is less than half of the memory consumption of object headers and pointers in table 1. Obviously, in object-based representation, the extra space overhead exceeds 200% (compared with the required overhead ).

 

2.2 When the number of objects and references reaches a certain scale

In a JVM, the GC thread periodically traverses all the surviving objects in the heap to identify and recycle dead objects. Assume that the number of surviving objects is n, and the total number of links in the graph structure composed of objects is e. The computing complexity of a tracing garbage collection algorithm is O (n + e ). In big data applications, object graphs often contain ultra-large-scale, isolated object subgraphs. Some represent data items, and some represent data structures created to process data items. Therefore, the size of Data Objects in the memory is large, and n and e are several orders of magnitude larger than that of conventional Java applications.

We use an exception example in Hive user mail to analyze this problem:

FATAL org. apache. hadoop. mapred. TaskTracker:
Error running child: java. lang. OutOfMemoryError: Java heap space
Org. apache. hadoop. io. Text. setCapacity (Text. java: 240)
At org. apache. hadoop. io. Text. set (Text. java: 204)
At org. apache. hadoop. io. Text. set (Text. java: 194)
At org. apache. hadoop. io. Text. <init> (Text. java: 86)
......
At org.apache.hadoop.hive.ql.exe c. persistence. Row
Container. next (RowContainer. java: 263)
Org.apache.hadoop.hive.ql.exe c. persistence. Row Container. next (RowContainer. java: 74)
At org.apache.hadoop.hive.ql.exe c. CommonJoinOperator. checkAndGenObject (CommonJoinOperator. java: 823)
At org.apache.hadoop.hive.ql.exe c. JoinOperator. endGroup (JoinOperator. java: 263)
At org.apache.hadoop.hive.ql.exe c. ExecReducer. reduce (ExecReducer. java: 198)
......
At org.apache.hadoop.hive.ql.exe c. persistence. Row
Container. nextBlock (RowContainer. java: 397)
At org. apache. hadoop. mapred. Child. main (Child. java: 170)

We checked the Hive source code and found that the Text. setCapacity () at the top of the stack is not the source of the problem. In the join Implementation of Hive, JoinOperator holds all Row objects from an input branch in RowContainer. If a large number of Row objects are stored in RowContainer, a single GC becomes very expensive. In the stack, the total size of the Row object exceeds the upper limit of the heap, resulting in memory overflow.

Even in scenarios where memory overflow is not triggered, a large number of ROW objects may degrade the performance. Assuming that the number of Row objects is n, the GC traversal complexity is at least O (n ). For Hive, n will increase proportionally with the input data, which can easily lead to a large amount of GC overhead. The following is a similar example of a user report from StackOverflow. Although the performance is not the same, the root cause is the same:

"I wrote a Hive query, which selects about 30 columns, approximately 400,000 records, and inserts them into another table. There is an internal connection in My SQL. Query failed because a Java GC overhead limit exceeded exception occurs ."

In fact, you can often see complaints about GC overhead on the Hive email list or StackOverflow site. Worse, there is no optimization from the developer perspective, because the root cause of inefficiency lies in the internal design of Hive. All data processing interfaces in Hive must use Java objects to express data items. To manipulate the data contained in a Row, we need to encapsulate it as a Row object and follow the interface design. If we want to completely solve this performance problem, we need to re-design and implement all relevant interfaces, and no user can afford this disruption. This example prompted us to look for solutions at the design level, and we should not be limited by the traditional object-oriented frameworks.

3. Design Example of inflation perception

The root cause of the above performance problems is that these two big data applications are designed and implemented completely following the conventional object-oriented principle: Everything is an object. Objects are used to express data processors and data items to be processed. Creating a Data Processor object may not have a significant performance impact, but using an object expression for a data item may lead to a large-scale bottleneck and impede applications from processing large data sets. It is worth mentioning that a typical big data application executes similar data processing tasks repeatedly, and a group of related data items often have similar behavior patterns and lifecycles, therefore, we can easily manage them in a large buffer block, so that GC does not need to traverse each individual data item to check whether it is dead. For example, in the Giraph example, all vertex objects have the same lifecycle, and the same Row in Hive. Therefore, we naturally think that we should allocate a large memory area, place the real content (data bytes rather than objects) of all data items in it, and manage them in a centralized manner, for JVM, this memory area is an object as a whole. If data items no longer need to be recycled, you only need to recycle the entire object.

Based on this observation, we propose a design example of inflation awareness to develop efficient big data applications. This example contains two important components: 1. Merge small data records and organize them into several large objects (such as byte buffering), rather than record one object one by one, 2. Access operation data through direct buffering (at the byte level rather than the object level ). The core of the design example is to limit the number of objects, rather than increasing it proportionally to the input data. Note that these guidelines should be clearly considered in the early design phase so that later API and implementation can follow these principles. We have built a big data processing framework, Hyracks, which strictly complies with this design example. In the future, Hyracks will be used to run some examples to interpret these design principles.

3.1 Data Storage Design: Merge small objects

As described in section 2, using Java to store data can cause various overhead of memory and CPU. Therefore, we propose to store a set of data items in a Java Memory page. Unlike the system-level memory page, which is used to process virtual memory, the Java Memory page is a fixed-length continuous memory block in the JVM heap. To simplify the description, we will use page to represent the Java Memory page. In Hyracks, each page is expressed as an object of the java. nio. ByteBuffer type. Setting a record to a page can reduce the number of objects. Previously, it is equal to the total number of data items, and now it is equal to the total number of pages. Therefore, the accumulation density of the system is closer to the ideal state. Note that it is only one of many methods to combine data items into a binary page. We will consider more merge schemes for small objects in the future.

There are many ways to place records on pages. The Hyracks system uses "slot-based Record Management" [36], which is widely used in the existing DBMS implementation. Taking the PR Algorithm as an example, Figure 3 shows that four vertices are stored in a page. It is easy to see that each vertex is stored in the compact layout shown in Figure 2. We use four slots (4 bytes) to store the offset (4 vertices, so 4 offset ). These offset values are used to quickly locate data items and support variable-length records. Note that the data record format is invisible to developers, so they can still focus on advanced data management tasks without worrying about the byte format. Because the page size is fixed, there is often a small amount of residual space, resulting in waste and cannot be reused. After the background is introduced, we will calculate the stacking density of this design. We assume that each page has an average of p records and the residual space has r bytes. The additional overhead expressed by each vertex consists of three parts: slots with offset (4 bytes) and allocated residual space (that is, r/p), apportioned page Object overhead (that is, java. nio. byteBuffer ). The page object has eight-byte header space (in the Oracle
64-bit HotSpot JVM) and a reference (8 bytes) point to an internal byte array, which occupies 12 bytes of space. Therefore, the additional overhead of the page object is apportioned to 28/p. Combined with section 2.1, we obtain a vertex that requires a total of 8 m + 8n + 24 + 4 + (r + 28)/p bytes, of which (8 m + 8n) + 24 is used to store required data. 4 + (r + 28)/p is the overhead. Since r is the size of residual space, we get r ≤ 8 m + 8n + 24. Therefore, the extra space overhead of a vertex is limited to 4 + (8 m + 8n + 52) /p. In Hyracks, we use 32 KB as the page size, and the size range of p is 100 to 200 (as shown in the real data experiment ). To calculate the maximum possible overhead, the residual space is equal to the vertex size in the worst case scenario. The size of a vertex is
(32768 −00004)/(200 + 1) = 200 to (159 −00004)/(32768 + 1) = 100, so 100 ≤ r ≤ 320. The extra space overhead of such a vertex is 4 bytes (because at least 4 bytes of slot is required for offset) to (4 + (320 + 28)/100) = 7. Therefore, relative to the size of actual data, the overall additional overhead rate is 2-4%, far lower than 200% Based on Object Representation (Chapter 2.1 demonstrated ).

3.2 Data Processor Design: Access Buffer

After implementing buffer-based memory management, we need to support buffer-based data processing programming. We propose an access-based programming paradigm. In the past, we always created a Data Structure in the heap, including various data items and expressing their logical relationships. Now, we define a structure containing multiple accessors, each accessor can access different types of data. Similarly, we only need a few accessors to process all the data, significantly reducing heap objects. In this chapter, we need to first make a thought change, transform the data structure of the previous object-oriented design into the corresponding accesser structure, and then use some examples to describe the execution process.

3.2.1 design Transformation

In the past, we designed a Data Structure for data items based on the object-oriented principle. The data type is D. Now we replace D with an accesser class-Da. Developers can specify whether a data type is a data item class. The procedure is as follows:
Step 1: Assume that f is a field in the D type and the type is F. we add a corresponding field fa in the Da, whose type is Fa, use Fa as the F accesser class. In D, you only need to directly copy the non-data item type to Da (non-data items are not important, and the data volume is not large. No matter what it is, it is mainly for the data item content stored in the page ).

Step 2: Add a public method set (byte [] data, int start, int length) to Da. This method is used to bind the accessor to a specified byte range in the page to access a data item of Type D. This can be made into hot instantiation or lazy instantiation. Hot instantiation will recursively bind all the member accessors fa to their respective binary regions, in lazy instantiation, It is not bound until the Member accessors need to be used.

Step 3: for each method M in D, we create a corresponding method Ma in Da. Then, replace all the data type parameters and return values of M with the corresponding data accessors. The accessors can be used as parameters or return values to access the data items bound to the byte range.

From the conventional object-oriented design to the above design, it should be implemented in the early development stage, otherwise the transformation cost will be too high in the later stage. We will also try to implement automatic design transformation through the compiler in the future.

3.2.2 Execution Process

At runtime, we can understand the associated accessors as a graph structure. Each accessors graph can process advanced records in batches. In the figure, each node is an access object of a field, and each line represents a "member" relationship (for example, a field f is contained in an object of class D, then, Da and fa are connected to each other to indicate their subordination ). An access map is similar to the skeleton of its corresponding heap data structure, but it does not store any data internally. Let's let the page flow through the accessors graph. The accessors are bound to the byte range of each data item in the page and then process the data item. For a single thread, the number of accessors is the same as the number of data structure types. different instances of the same data structure can be processed by the same accessors.

If hot instantiation is used, the number of accessors created during task execution is equal to the total number of nodes in All accessors graph. If lazy is used for instantiation, the number of accessors created can be significantly reduced, because a member accessors can often be reused by several different data items of the same type. In some scenarios, you also need to attach an accessors object. For example, if there is a compare method in the data item class, It will compare two data item parameters. The transformed compare method requires two accessors to compare object parameters at runtime. No matter what method is used to implement the accessors, the number of accessors must be determined during compilation and will not increase proportionally with the base of the data set.

[Translator's note]

The accessors seem to be complicated. In fact, the principle is very simple. It is as if you are a student information management system, there must be three types of objects: school, class, and student. Assume that there are 1 school, 10 classes, and 600 students. If the object-oriented principle is used, 611 objects will be created when data needs to be processed in batches or resident memory; if you use this example, there will be only four objects, one school accesser, one class accesser, and one student accesser, there is also a careful byte array containing 611 data items in the Page object. Move the appropriate accesser to the appropriate location in the page to access the data items corresponding to the location, such as class 302 and a student. Therefore, the number of accessors is only related to the model type and can be determined during the compilation period (rather than increasing with the increase of data items at runtime ). In addition, it can be seen from the text that the ideal state of an accessor in a single thread is the singleton state, that is, only one object is created for an accessor type, in sequence, it is moved to the appropriate offset to access data items one by one.
But sometimes it is not a singleton. For example, the class originally only has one array accesser Member, ListAccessor <student> students, there are two Arrays: ListAccessor <student> maleStudent and ListAccessor <student> femaleStudents. The accessors are of the same type, however, there are two member objects in the graph structure of the accesser (if it is in hot instantiation mode, there are two members. If it is in lazy mode, there is one Member object, which can be reused. Each use only requires a new set Offset ). For example, if some methods in the Da class need to process the current object and other Da objects, they also need to be appended to create other Da objects.

It is not hard to understand why it is called an access Graph. Any data structure is not isolated and must contain member variables, the member variable may not be a simple native type (such as int or double), or may be a reference to another data structure object. This type of member relationship is recursively repeated to form a graph structure. Therefore, the accessors of each data structure also need to be organized into a similar graph structure. When the "root" accesser is set to the "root" offset, the member accessors also need to recursively set to the offset of each member.

3.2.3 running example

Based on the three steps, we transform the vertex example in section 2.1 into the following form:

12345678910111213141516171819202122232425262728 public
abstract
class
EdgeListVertexAccessor< 
I extends
WritableComparableAccessor, V extends
WritableAccessor, E extends
WritableAccessor, M extends
WritableAccessor> extends
MutableVertexAccessor<I, V, E, M> {
    private
I vertexId = null;
    private
V vertexValue = null;
    /** by S1: indices of its outgoing edges */    private
ListAccessor<I> destEdgeIndexList = new                           ArrayListAccessor<I>();
    /** by S1: values of its outgoing edges */    private
ListAccessor<E> destEdgeValueList = new                            ArrayListAccessor<E>();
    /** by S1: incoming messages from the previous iteration */    private
ListAccessor<M> msgList = new
ArrayListAccessor<M>();     ......
   /** by S2:
    * binds the accessor to a binary region *of a vertex
    */    public
void set(byte[] data,
int start,
int length){     /* This may in turn call the set method
    of its member objects. */ ......
    }
    /** by S3: replacing the return type*/    public
ListAccessor<I> getEdegeIndexes(){     }
    ...
}

 

In the above code snippet, we highlight the modified Code and comment to describe the transformation steps. Figure 4 shows a heap snapshot during running. Real data is arranged in the page, and an accessor graph is used to process all vertices, one at a time. For each vertex, the set Method binds the accessor graph to its byte area.

4. Experiment

[Translator's note] a series of comparative experiments are attempted using various algorithms in the future. Here we will not detail the translation, but will post several performance comparison diagrams of the experiment results for your experience, for more information, see the original document.


 

Translator's summary]

The essence of the article is simply saying: do not create an object, use a solid byte array. Using byte arrays not only reduces memory space usage, but also avoids GC traversal of massive objects. The idea is similar to that of distributed small file storage systems (assembling small files into a fixed chunk, avoid the metadata waste and retrieval overhead of massive small files ). But it is easy to say. We use Java to enjoy the pleasure of the design model, the uniform style, and the convenience of object-oriented, and change it into a byte array. What is the difference between it and C? Even if we hope to use various methods such as the accessor graph to make up for this defect, it is hard to save the sense of superiority caused by the loss of object-oriented.

However, there is a 80% principle in the world, that is, the 20% code, or even less, that affects of your system performance. The translator believes that the fish and the bear's paw cannot have both sides. We still implement the object-oriented principle in the big architecture, domain model, and ER design, but we can make a compromise on the data stream as emphasized in this article.

 

 

-- Scan and add attention; No.: importnew --


Original article:
Translation of asterix.ics.uci.edu: ImportNew.com
-Chu Xiaoying
Http://www.importnew.com/5061.html.
[Repost the source text, translator, translation link, and the QR code above.]

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.