Actual Case of memory consumption application optimization

Source: Internet
Author: User
Tags deflater thread logic

here we share a distributed Analysis System master the optimization of memory consumption, some of the more specific optimization may not be applicable to other systems, but from this series of optimization process, we should consider some optimization points for other systems during design. The following describes the background. After reading the background, we can make it clearer about the subsequent optimization points. Note, some designs only apply to a large amount of computing, and sacrifice maintainability in exchange for performance improvement. The final optimization should be more universal.
background:
the open platform generates a large number of call logs every day, hoping to analyze business indicators and system running conditions from massive logs instantly. The current implementation is similar to mapreduce design, however, master and slave the relationship is loosely coupled, compared with the traditional mapreduce it is more conducive to expansion and instant analysis (of course, it is different from hadoop , the scale and function are also different. It is mainly used for variable statistical rules and requires real-time analysis. The data volume is less than T ), at the same time, the statistical rule engine is embedded, so that the statistical logic can be analyzed and defined only by configuration. The deployment diagram and flowchart are as follows:


The process is as follows:


MasterAndSlaveThere is no relationship between registration and management,SlaveConnectMasterRequest task, obtain the data source from the task, analyze the rule configuration, obtain the data block size information, then pull the data block for analysis, and finally return the resultMaster. So far,MasterAndSlave.

MasterThey are mainly responsible:

1.
Create and reset a task list (create a task list based on configuration information. Because it is an incremental analysis of the application server, all tasks are reset periodically to allowSlaveIncrementally drag data from the application server for analysis ).

2.
Reset some assigned tasks that have not been executed for a long time to preventSlaveTask execution fails without feedback.

3.
MergeSlaveThe completed task result set is passed to the master result set.

4.
Periodically output the trunk result set to a third party (alarm, graphical, etc.), export the intermediate result, and provide itMasterReply to the site after abnormal restart.

problem:
as the number of report configurations increases, a large amount of data is generated for a single report. master the memory for merging multiple result sets is tight, GC causes a vicious circle. Therefore, optimization of the master that was originally considered to be a task of no importance is imminent.
optimization process:

1. During the merge process, the main result and slave the results are relatively large. After the operation, can you take the initiative to clear and set null to clear and release resources faster. ( basically no effect, GC many optimizations have been made )

2.Analyzer analyzes data based on definitions.<Key, value>Result set.KeyConcatenates the same resultsKey, value1, value2, value3(This is the traditional report result ). Some configuration items are found<Key, value>Rules are not used in actual output reports. Therefore, you can directly filter these configurations when creating analysis rules. (That is, when many systems are implemented, some results are intermediate results. If the intermediate results need to be determined at system startup, the calculation logic of these intermediate results will be filtered out, saves computing resources and memory resources, and provides some prompts, which may be caused by system configuration errors that make this part of data unavailable.)

3.It is used in many places in the system.CalendarFor example, you want to obtain the data of year, month, day, hour, minute, and second.ActionFor example, format the content and classify it as output. BecauseCalendarThe thread is not secure, so it has to be constructed and used on a large scale. In fact, the memory consumption is large.

Transformation Method: usableLongAllLongTo process,System. currenttimemillisConsumption, but small. If you want to calculate the year, month, day, hour, minute, and second, you can use Division to obtain the remainder (note that China time difference should be considered when calculating the day ).8Hours ). At the same time, if it is an intermediate result and the output will also be output later, because the output needs to be easily viewed by the user, so you want to format it. We recommend that you keep the numeric type in the system until the output is formatted. (However, this depends on the frequency at which intermediate results are output and used internally. If there are a large number of times that output is reused, internal processing can be completed, avoid Multiple formatting)

4.After observing for a while, we found thatSlaveProcessing results are returned at peak hours.5-6 mEven higher.MasterProcess multipleSlaveHigh overhead (the receiving cache area followsSlave). Therefore, in order to optimize the network and receive and send cacheMapAnd then compress the data.

Transformation Method: Consider usingQuicklzThis simple open-source class is used for compression, butOutputstream.OutputAfter comparison, the compression effect is equal to the two, and the speed is no longer compared, becauseOutputPipeline has better results,CodeAs follows:

Bytearrayoutputstream bout =NewBytearrayoutputstream ();

DeflaterDef =NewDeflater (Deflater.Best_compression,False);

Deflateroutputstream =NewDeflateroutputstream (bout, DEF );

Objectoutputstream objoutputstream =NewObjectoutputstream (deflateroutputstream );

The last bytearrayoutputstream will become the bytebuffer data source.(After compression, the consumption of network transmission and receipt buffering will be reduced, but at that time, it was not considered that the data was not big, and the CPU consumption was reduced. However, the current scenario has changed, therefore, CPU consumption is required to save memory. Therefore, you can optimize the performance based on different scenarios and weigh the gains and losses)

5.Below is a sectionNIOThe code after receiving business data usually seems clean and formal, but it is a piece of demon code in the case of high concurrency and massive data.

Byte [] content = new byte [effecepacket. getbytebuffer (). Remaining ()];

Receivepacket. getbytebuffer (). Get (content );

Log. Error ("Package content size:" + content. Length );

Bytearrayinputstream bin = new bytearrayinputstream (content );

After modification:
Bytearrayinputstream bin = new bytearrayinputstream (
Receivepacket. getbytebuffer (). Array (), receivepacket. getbytebuffer (). Position ()
, Receivepacket. getbytebuffer (). Remaining ());
Allows the input stream to be directly based on Bytebuffer Instead of re-applying for memory to copy data. In fact NIO Of Buffer And ChannelSince Steam There is no bridging method for the operation, so many times they tend to apply for memory to read and then use Stream .( Buffer Many internal methods support mirroring, subset, and other operations to maximize the reuse of internal data streams. Therefore, we need to carefully weigh whether data streams can be reused, however, it should be noted that the reuse mode needs to be carefully considered, otherwise the cursor for reading and writing data will affect each other)

6. Merge ( Reduce . If 50 Items Job , Then 50 Items JobAll results Master To merge, the pressure and memory consumption must be very large, if you can Job The result is in Slave To ease Master . Therefore Slave A system-level parameter is configured for each request. Master Maximum allocation Job Number. Modified Master And Slave Directly obtain the task protocol. You can apply for multiple Job , Master Return tasks that are less than or equal to the number of requests based on task completion. SlaveThe mechanism for parallel execution and Merging Results here is actually available early in the morning, just from the analysis of large files in the current year to HTTP Not fully utilized after incremental analysis of data streams Slave Parallel processing capability.(There are many such designs, in fact SD I talked about several simple scenarios at the meeting, Top The objects returned by the business must be formatted as standard XML Or JSON Method, one is Top One way is to remove some of the business logic and share the computing and memory consumption to more application nodes. The problem is that the logic cost for upgrading the external migration is high. The advantage of centralized processing is that it is easy to maintain logic and can be used multiple times at a time. The advantage of shared processing is that more resources are fully utilized to solve large-scale problems)

 

7.PassJstatOfGcutilObserved and foundHeapAdditionMergeIn addition, the report output also experienced great fluctuations. It was found that to ensure abnormal system exit, the incremental statistics can continue when the system restarts again, the memory data object will be exported each time you reset the task list and output the report, so that the next load is convenient. Now every3Minutes are the task reset period, that is, every3The intermediate results are exported once every minute. This frequency is too high. Therefore, the export Action Settings are extended. After all, exceptions do not occur frequently and commands can be actively exported. On the other hand, the output content is compressed to reduce memory consumption and export time.(In many cases, we will design some abnormal protection policies and checks, but do not make such work a burden on the system. By enlarging the scale and accepting active and real-time processing, can achieve the same effect)

 

8.This optimization seems silly, but the effect is obvious. In fact, it illustrates the same problem, that is, a little detail can help youProgramThere has been a big change.

CurrentMapThe result set format is as follows:Map <entryid, Map <key, value>,EntryIt represents<Key, value>Computing definition. The Processing Method for merging multiple result sets is (the following is pseudo code)

Map <entryid, Map <key, value> [] needmergeresult ;//This is an array of external results to be merged.

Map <entryid, Map <key, value> result = New Map <entryid, Map <key, value>;//Construct a merged result set

For(J = 0; j <needmergeresult. size; j ++ )//Traverse all result sets
{
Map <entryid, Map <key, value> node = needmergeresult [J];

Loop : Traversal Node All Entryid
{
Loop : Traversal Entryid Corresponding Map
{

According to the rules Key Corresponding Value And Needmergeresult [J + 1]. Get (entryid). Get (key) To Needmergeresult [needmergeresult. Size-1]. Get (entryid). Get (key) Of Value Merge and then remove Needmergeresult [J + 1]. Get (entryid). Get (key) To Needmergeresult [needmergeresult. Size-1]. Get (entryid). Get (key) Corresponding data to avoid repeated external computation

}
}
}
I wrote a lot, but it was not optimized.Algorithm(If you think the merge algorithm is better, let me know.) The optimization is to remove the red sentence, that is, building a new result set as the basic result set, the current practice is to select before mergingBaseThe largest result set is used as the basic result set, and the subsequent processing is the same. In this way, the memory application is omitted and the existing memory space is used properly, this step also optimizes the final Parallel Merge results.

9. In addition to the online operation period GC In addition to observation, the local data volume is small, but it also runs MasterAnd Slave Use Jprofiler After observing, we found that there were a lot Concurrentmap size For a long-term high-concurrency processing system, there is also a lot of loss ( Concurrentmap Internal Partition storage is used to improve efficiency. Therefore Size Is not a simple counter. Atomic Type of atomic counters, the cost is the increase in program complexity.(This optimization is based on the scenario. If the program is not frequently operated in this aspect, it is easy to use. Size The method is more reliable. It must be noted that Java The internal implementation of many concurrent components is not simple, so if the call times are frequent, you can consider other methods for implementation)

 

10.
MasterThe change of the main thread blocking merge result set. From the above design drawing, we can see that I considered using the single-thread blocking mode to merge result sets in the initial design. The reason is very simple, all the results will eventually be merged into the "trunk", so no matter how the merged action will be locked, that is, serialized, it is better to simply use a single thread to block the processing.

Symptom:SlaveThe results after processing and merging will come from time to time, becauseSlaveThe data size after analysis increases by several orders of magnitude.MasterThe blocking merge Time also increases, and the result set hanging on the merge list will also increase (the lifecycle of intermediate results increases, which directly leads to increased memory consumption ), what needs to be done is to minimize the memory accumulation consumption caused by full processing and improveMasterMemory usage.

The optimization process is as follows:

1.
The main thread is only responsible for distributing the merge results, and the merge execution is an external thread pool. (Question: how to merge multiple threads to the trunk concurrently? The lock method is used, so there is still only one thread to execute the merge, and it is still a serialized Operation)

2.
The design adopts two types of integration. Design:

1. MasterThe main thread is responsible for obtaining the result set to be merged (including the originalSlaveThe submitted result set and the merged result set mentioned later)

2. MasterThe main thread distributes merge tasks to the thread pool.

3.The execution thread of the thread pool tries to obtain the trunk lock before execution.

4.If the master lock is obtained, all results are collected to the master result set.

5.If the master lock is not obtained, the result set is merged internally, and the merged result set is put into the queue, waiting for the Merge again.

First, the merge method mentioned above is based on a merged result set. Therefore, multi-group merge is acceptable in terms of resource consumption (only consumed during computing ), the coexistence of a large number of original results can be replaced by a small number of intermediate results. Second, any merge is not waiting to merge with the trunk. It can be parallelized. There is not much special processing to merge with the trunk. The working thread logic is unified and there is no difference in processing, improves the utilization of the thread pool.

Problems:SlaveIt is difficult to predict the arrival time of the original results,MasterFrequent small-scale merge results in negative results, and repeated merge of intermediate results, wasting computing resources.

3.
According2The questions raised have been improved. FirstMasterTwo system parameters are added: the minimum number of tasks executed in batch and the maximum waiting time for task accumulation. When the maximum waiting time for task accumulation is reached, only when the minimum number of batch executions is reached can the thread pool be submitted for execution. (When the current merged result set is found+Merged result set=Total number of all results. This condition is invalid ). Second, the intermediate results are not involved in the calculation of non-trunk merge, unless the intermediate results are the final results to be merged. The modified flowchart is as follows:

After a preliminary configuration test is conducted online, the effect is obvious, the memory usage is improved, the release speed is accelerated, and the overall execution time is shortened.

In fact, this optimization point summarizes the substantive features behind it: When all operations still need to be locked to a bottleneck point for serial operations, the simplest way is serial processing. (Simple and efficient) However, some optimizations can be made in some scenarios:

1.
Save resources. If parallel processing can reduce resource consumption before serialization, when the overall transaction processing time remains unchanged, resources can be fully utilized (when there are sufficient reverse resources, the system processing capability may be accelerated, which indirectly helps increase the transaction processing time ).

2.
Reduces the pre-processing computing time. Simply put, it is to sharpen the knife without mistaken cutting of materials. In the previous article about task cutting, and then the event-driven mode, the biggest advantage of task cutting is that it can accelerate the release of resources consumed in different stages, that is, parallel operations can be performed in parallel. If10This phone book,1In a telephone room, the caller can check the phone number out of the phone booth, and then directly call the phone number in the phone booth, because the phone book query can be performed in parallel on a small scale, which can improve the efficiency of serial processing.

The last point is important in parallel computing.SlaveDuring the design, in the multi-thread processing task, the operator's logic should be as non-differentiated as possible. The task is self-contained description, and the worker logic is a general logic engine, in this way, task scheduling is simple and easy to expand with the thread pool or different machine processes.

Optimization seems to be very simple, but there are a lot of skills in how to observe, analyze, solve, and summarize problems. Only these steps can be done well, it is possible to perform optimization, otherwise it is empty talk.

This article is transferred from www.35java.com 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.