Memory Leak analysis and sharing

Source: Internet
Author: User
Tags jprofiler

The more busy it is, the more complicated it is. Recently, the memcache client code is being optimized. At this time, the out of memory (OOM) problem suddenly occurs in the SIP. This is the biggest headache for development, stress tests have been performed, and these problems have not occurred in earlier versions. Therefore, it may be caused by the last release change. Jprofiler builds a system in the test environment and starts a stress test to analyze where the memory stream is going.

 

Problem 1: Connection Pool Leakage

When I saw this problem, I think many people have said that there were still such problems when the open-source ready-made connection pool was used, not when JDBC was used. Let's look at the phenomenon.

 

Scenario: The test Department uses LoadRunner to compress and finds that many business objects are constantly increasing. However, according to the business scenario, these business objects are automatically released after being processed. (It is automatically released after being verified in the local development environment)

 

Jprofiler:

 


You can see that many business objects have accumulated a lot of memory. After stopping the stress test, wait for a while, and then use jprofiler to initiate garbage collection, after seeing the output of GC Collection records in the JBoss background, we found that these objects still exist. That is to say, these objects have become one of the causes of Memory leakage. However, as I said, in the local test and white-box test, after a request, these objects will be released and not referenced by other maps, then, through jprofiler, we can see the allocation call tree of these objects, that is, the servlet we are processing the request as the source, but why is the servlet not cleared? Next let's take a look at the next two figures.

 

 

Now that you know that the object exists and is held, check the running status of the thread, this check shows that many threads are in the wait status (in fact, you can also see dump on the server). This figure shows that, I chose one of the wait threads. The reason why it is in the waiting state is that it is in the waiting state when ibatis throttle is used as increment. I looked at the ibatis code, this part of the code is actually a piece of code of the ibatis connection pool. After the connection pool is full, the code is waiting for release. That is to say, the program has exhausted the connection pool.

To verify whether the database is exhausted, let the DBA boss show me the connection of MySQL (the log data of this day is saved in MySQL) and find that there are only eight connections, it seems that the connection pool is not actually exhausted. Glorious tells me that these eight connections are all doing the same query, that is, counting the access record times and traffic of an API. In the current business process, two types of MySQL operations are performed:

1. statistical query created by the access control counter.

Memcache counters are used to control access to open APIs. If you find that no counters have been created for such APIs, you can analyze Mysql Data and create counters. In addition to inserting data into the database, you also need to accumulate counters for subsequent access records, in this way, access control can efficiently use centralized counters without the need to query databases.

2. Batch asynchronous log writing.

For Open API records, each thread in the thread pool maintains a memory page. When the page is full or the refresh interval is reached, data is written to the database in batches at a time to relieve the database write pressure, transactions are used for batch commit.

 

For the first operation, MySQL only retains the data volume of the day during the design, so only one statistical task is performed when the system is started. The database pressure and SQL Execution should not be too high, however, since the stress test started last afternoon, there are tens of millions of data in it. Therefore, this restart started the stress test, resulting in slow execution of the SQL statement used to create the counter. At the same time, the batch write of logs is committed in the transaction mode, which is not very in-depth for MySQL, but it seems that the problem should occur here, due to the slow query and the batch commit of transactions, the transaction may fail, and the signal for releasing resources is not correctly transmitted to ibatis, resulting in seemingly exhausted connection resources.

I deleted all the records in the database, restarted, and started the stress test. The problem does not exist, and the objects were recycled in time. We will also follow up on this issue in the future. In earlier versions of ibatis, there was also a deadlock problem in this class, and later the upgrade was solved, however, I have also seen many foreign friends say that 2.2 and 2.3 still have deadlocks, but I personally think it may still be related to the database.

 

Question:

I still have some questions behind this issue. For us, if a common HTTP request times out, it will be automatically interrupted, but in this scenario, after waiting for one hour, I still haven't released it. That is to say, the client has actually been disconnected, but JBoss does not seem to release these transaction processing requests, resulting in resource jamming.

 

Question 2:LinkedblockingqueueMake trouble

Since jdk1.5 and later, the concurrent package has provided you with many convenient and efficient development models. I have used javasblockingqueue in many places as a data channel between consumers and producers, consumers wait for the producer to provide data at the ingress blockingqueue gate. After obtaining the data, they start to process it in parallel. Here I will use queue. Poll (100, timeunit. milliseconds) to obtain data in semi-blocking mode. As a matter of fact, I have heard yesterday that memory leakage may occur in javasblockingqueue. Many people on the Internet have mentioned this problem and it has not been solved in 1.5 categories, the problem will be fixed in 1.6, but there is no evidence and it is difficult to judge it. After the problem was solved, we continued to look for potential bugs. At this time, we accidentally found that an object was always increasing over time, but because a single object occupied a small amount of memory, therefore, it is not obvious, but the increase in the object instance is obvious. Let's look at the following two figures:

 

 

 

 

 

 

The interval between the two images is about two hours. We can see that the instance of this object has increased a lot, and the memory has also consumed a lot. We looked at the tree for creating this object, the discovery is the result of the poll method, that is, the periodic scanning of threads in the thread pool. During this period, there was no access, but there was a huge increase. I tried to change poll (100, timeunit. milliseconds) to the full blocking mode of Poll (), and the object growth continued. Therefore, it can be seen that the memory leak of the server is largely caused by this part. It was not found earlier because there were not many users shortly after the SIP was launched, and more sub-users were involved in this time, in addition, the update requests in the API are more memory-consuming, which makes it easy to detect such problems.

So does 1.6 solve this problem? I started to use the version 1.6 _ 01 on the machine and found that the problem persists. I downloaded the latest version 1.6_07 from sun and found that it will be recycled, however, both recycling and growth exist. The specific data is described as follows:

1. 1000 instance 31 K

2. 200 instance 6 K (partially recycled)

3. 1500 instance 46 k (found to grow more than before)

4. 300 instance 9 K (partially recycled)

5. 2000 instance 62 k (found to grow more than before)

 

That is to say, recycling occurs from time to time, but the overall trend is still rising. This really requires a good test. If you are interested, you can also test my testing method, you only need to use a batch blockingqueue, and then regularly go to the pool, 1.5 is definitely growing.

For this problem, I can only verify it again. If it is found that it is inevitable for the moment, then only the alternative solution is considered.

 

This is a part of memory leak today. I hope to share it with other students who encounter or will encounter problems, if there are conditions, it is absolutely good to use jprofiler to analyze the performance. If there are no conditions, dump and GC output to find the problem.

This morning I saw a friend leave a message for me, saying that I used the take method to replace the poll method. I used to look at the timeout of poll, in fact, in the current scenario, taking can also be used. After testing in the morning, taking has no such problem, so I replaced poll with taking.

ZZ: http://blog.csdn.net/cenwenchu79/archive/2008/09/18/2949103.aspx

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.