Objective
OutOfMemoryError
The problem is that many friends have encountered this kind of problem is difficult to locate and solve relative to common business exceptions (array out of bounds, null pointers, etc.).
In this paper, a recent encounter on the online memory overflow location, the way to solve the problem, I hope to encounter similar problems with students to bring ideas and help.
Mainly from 表现-->排查-->定位-->解决
four steps to analyze and solve the problem.
Representation
A recent application of our production has been a constant burst of memory overflow, and the frequency of the increase in the volume of business has become more and more frequent.
The business logic of the program is very simple, which is to consume the data from Kafka and then do the persistent operations in batches.
The phenomenon is that as the Kafka messages, the more frequent the anomalies are. Since there were other jobs at the time, it was only possible to restart the operation and monitor the heap memory and GC.
Although it's good to restart Dafa, it still doesn't solve the problem at all.
Investigation
So we want to try to determine where the problem occurs based on the memory data that was collected before the operation and the GC log.
It turns out that the use of memory in older generations has been high, even in the case of GC, and has been getting higher over time.
Combined with the Jstat log found that even the FGC of the old age has not been recycled, memory has been to the top.
Even a few applications FGC reached hundreds of times, the time is also high terrible.
This indicates that the memory usage of the application is definitely problematic, and that there are many cheat objects that are never recycled.
Positioning
Because the memory dump file on production is very large, it reaches dozens of G. Also because our memory settings are too large.
So it takes a lot of time to think about using MAT analysis.
Therefore, we would like to be able to reproduce locally, so as to better locate the more.
To reproduce the problem as soon as possible, I set the local application maximum heap memory to 150M.
Then in the consumer Kafka there is a Mock for a while loop that constantly generates data.
At the same time, when the application starts, it uses VisualVM to monitor memory and GC usage.
The result ran 10 minutes of memory usage and there was no problem. As can be seen in the figure, each time GC memory generated can be effectively recycled, so this does not reproduce the problem.
It's hard to locate a problem that can't be reproduced. So we review the code and found that the logic of production is not quite the same as the Mock data we use while looping.
View production Log discovery every time you fetch hundreds of data from Kafka, we only have one at a time when we Mock.
In order to simulate production as much as possible on the server running a producer program, has been continuously sending data to the Kafka.
Sure enough, no accident only ran for more than a minute memory on the top, observe the left to see the frequency of GC is very high, but the recovery of memory is dwarfed.
At the same time, the background also began to print memory overflow, which will reproduce the problem.
Solve
The current performance is that there are many objects in memory that have strong referential relationships leading to not being recycled.
So I want to see exactly what the object occupies so much memory, using VisualVM's heapdump function can immediately dump out the current application memory situation.
It turns out that objects of the com.lmax.disruptor.RingBuffer
type occupy nearly 50% of the memory.
Seeing this bag naturally reminds me of the Disruptor
circular queue.
Again review code discovery: The 700 data extracted from the Kafka is lost directly to Disruptor.
This will also explain why the first simulation data did not reproduce the problem.
The simulation is when an object is put into the queue, and the production is 700 data in the queue. This data volume is 700 times times the gap.
and disruptor as a ring queue, the object is not overwritten until it is there.
I also did an experiment to prove it was true.
I set the queue size to 8, I write 10 data from 0~9, and when I write to 8, I cover the previous 0, and so on (similar to the HASHMAP location).
So on the production assumption that our queue size is 1024, then as the system runs it will eventually result in 1024 locations full of objects, and each location is 700!
So we looked at the Ringbuffer configuration of the disruptor on the production, and the result was: 1024*1024
.
This order of magnitude is very scary.
In order to verify whether this is the problem, I will change the value locally to 2, a minimum value to try.
The same 128M of memory, but also through the Kafka has been continuously extracting data. By monitoring the following:
ran for 20 minutes. The system is all right, and whenever a GC can reclaim most of the memory, it will eventually appear jagged.
This problem is found, but the production of the value of the specific setting is based on business conditions to be tested to know, but the original 1024*1024 is absolutely no longer used.
Summarize
Although in the end also changed a line of code (haven't changed, directly modify the configuration), but this troubleshooting process I think it is meaningful.
It will also make the majority of the JVM like the black box difficult to do with the students have an intuitive feeling.
同时也得感叹 Disruptor 东西虽好,也不能乱用哦!
Related Demo Code view:
Github.com/crossoverjie/jcsprout/tree/master/src/main/java/com/crossoverjie/disruptor
Your point of praise and forwarding is the biggest support.