The reason is that during the test of the new version of our cluster application (three machines), the general JVM memory usage is about 1 GB, but after a period of running, it slowly increases to 4 GB, this is obviously abnormal.
Positioning process:
1. first, follow the steps on the machine to try to reproduce the scene. When a problem occurs, open JDK's jvisualvm on a machine to observe the JVM memory usage. At this time, it is obvious that GC is very intensive and the line is very dense, almost together. Later, as time increases,
The heap curve increases slowly. In this case, it is suspected that the Code has an endless loop, and a large number of junk objects are frequently GC.
The main JVM startup parameters are-xms1024m-xmx1096m-XX: maxpermsize = 512 M-xmn512m-xss512k-XX: + useparallelgc.
Here we can see that the heap goes up after frequent GC and is close to the allocated heap upper limit.
2. Try dump heap and thread:
JPS view PID
Jmap-dump: format = B, file = heapdump. hprof PID
Jstack pid> threaddump.txt
Start thread dump with TDA and try to analyze whether there are some deadlock resource competition problems. No Deadlock report is displayed, but most of the suspicious operations are waiting for the monitor problem.
I have read that most of the sleep threads are stuck on java. util. Vector. This should be the underlying problem caused by some IO operations.
"TheadPool:AuditLookup:Waiting" prio=6 tid=0x0000000008e8c800 nid=0x968 in Object.wait() [0x000000000e45f000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00000000c20d42d0> (a java.util.Vector) at com.***.util.BlockingQueue.remove(BlockingQueue.java:114) - locked <0x00000000c20d42d0> (a java.util.Vector) at com.***.util.ThreadPool$PooledThread.run(ThreadPool.java:184) Locked ownable synchronizers: - None
3. After obtaining hprof, use eclipse memory analyzer, an open-source tool contributed by IBM, to open the leak report and check the overview.
Here, we can see that byte [] occupies a lot of resources, accounting for 75%. Let's take a look at the details.
4. The following shows a common path to the memory aggregation point. More information can be found here. We can see that this message is caused by the service. This service is called.
Serialize the stream objectinputstream and store a large number of binary byte arrays in its handletable. Some package names are removed.
The message service calls the lower-level JDK objectinputstream, and it is suspected that something is wrong. I have read the related introduction and the internal handletable code of objectinputstream,
Handletable is used to receive streams. Some major code snippets are as follows:
HttpURLConnection huc= ( HttpURLConnection )new URL( url).openConnection();ObjectInputStream ois = new ObjectInputStream( huc.getInputStream() );ois.readUTF();ois.readObject();
It's easy to open the input stream and read the string and object. There's nothing special about it. What's the problem with this? Continue to check and see some articles saying that objectinputstream has the risk of memory leak.
It seems that this is not the starting point of the problem. objectinputstream is only possible when it is abused. In this way, we can continue to look back at the Business Code, because three machines will be connected together, and two machines will not, so we will continue to check the log,
Use command line to start the application. When it is found that the method of sending message is refreshing, and the two machines are constantly sending each other, observe the network usage of the system:
It can be seen that the network of this machine A has risen from a low send <1.1 kbps rate to Mbps, which is a very obvious broadcast storm, and the same is true for another machine B, the third server C does not have any obvious network
Fluctuation.
It seems that there is a problem with the message distribution logic. Here we will briefly introduce the current message distribution logic, otherwise we will not be able to explain how to solve it. The message distribution has been modified. The previous logic is as follows:
There is one host (A) in the cluster, and the others are extensions (B, c). When a message is sent from extension (B), a determines the message type, for broadcast type, broadcast to the list of all extensions saved in a (step 2 and 3)
In this process, B will determine whether the message source is itself and repeatedly send messages. C if you want to send messages to B, you can only send messages to host a first, and a will forward messages to host B.
This logic was broken by this upgrade. This closed message group is added to a new message host, which has its own extensions and forms another message distribution ring, the new host is not considered during message sending, as shown in the figure below:
After B sends a message to a, host a is designed to distribute messages to all its extensions (B, c, d, therefore, another group of message host D is also considered to be the extension of A. The message of B is propagated to D. After receiving the message, d checks its extension (E, F, ),
After sending this message back to a (Step 3), A and D will continuously forward messages whose source is B, resulting in the generation and destruction of a large number of message objects, memory fluctuations and network congestion.
After the distribution mechanism of two message hosts, A and D, is modified according to the logic, the problem is solved, the memory GC returns to normal, and the network traffic also drops.
A process of solving JVM Memory Leak