Unbalanced pressure of Java applications on multi-core servers

Source: Internet
Author: User

The problem of this blog cannot be solved. It is just an analysis and speculation. Some subsequent actions may prove some guesses and may not solve anything. If you have any questions in the same situation as me, please kindly advise.

Problem:

Last weekend, I went out to outing with my colleagues and managed my children at home. I went to the production environment and observed the current running status of the cluster machines. I found that the pressure on these multi-core machines was extremely uneven.

The general status of top is as follows:

Http://www.flickr.com/photos/33194437@N03/3702676767/ (no way, blog can not upload, reference pictures, had to give the link)

 

At the peak, the usage of a single CPU reaches 80%, which is abnormal for multi-core servers. For Java developers, multi-threaded programming cannot control how threads are allocated on the CPU. Because Java itself does not implement the thread mechanism, it is a cross-platform language, however, the performance and features vary greatly according to the implementation of the operating system. Therefore, Java tuning sometimes requires system configuration and even Kernel Tuning.

Analysis:

First, we performed the same stress test multiple times in the test environment, tried the same operating system version and similar configuration as the online version, but the test result was that the load distribution was very even.

Http://www.flickr.com/photos/33194437@N03/3703485402/ (no way, blog can not upload, reference pictures, had to give the link)

At this time, I restarted a faulty machine and found that the load was reduced and balanced. That is to say, there should be no such high CPU consumption under the current pressure, some hardware or operating system configuration problems are also ruled out.

When the CPU usage is full, it is often considered to be caused by loops, especially for the consumption of a single CPU. Use top h to check which thread consumes CPU for a long time.

Http://www.flickr.com/photos/33194437@N03/3702676803/ (no way, blog can not upload, reference pictures, had to give the link)

 

We can see that the thread with PID 13659 is the "culprit", but it is unknown whether the thread with PID 13659 is in an endless loop, whether it is the application thread or the system thread. Then, based on the Java method, kill-3 PID and check the output log.

Find the NID in the dump Log Based on the thread number and find that the thread is a VM thread, that is, a VM thread. (Here, convert 13659 to hexadecimal 0x355b)

Http://www.flickr.com/photos/33194437@N03/3703479942/ (no way, blog can not upload, reference pictures, had to give the link)

 

I used pstack to check the work of this thread. The result is as follows:

Thread 2074 (thread 1846541216 (lwp 13659 )):

#0 0x0659fa65 in objectsynchronizer: deflate_idle_monitors ()

#1 0x065606e5 in safepointsynchronize: Begin ()

#2 0x06613e83 in vmthread: loop ()

#3 0x06613a6f in vmthread: Run ()

#4 0x06506709 in java_start ()

#5 0x00aae3cc in start_thread () from/lib/tls/libpthread. so.0

#6 0x00a1896e in clone () from/lib/tls/libc. so.6

I searched objectsynchronizer: deflate_idle_monitors and found a bug in Sun's bug Library about jdk1.6 due to this method resulting in runtime problems: http://bugs.sun.com/bugdatabase/view_bug.do;jsessionid=803cb2d95886bffffffff9a626d3b9b28573? Bug_id = 6781744

Then directly go to the openjdk official website to find this class of code, a general understanding of his role, the specific code link is as follows: http://xref.jsecurity.net/openjdk-6/langtools/db/d8b/synchronizer_8cpp-source.html
The main task should be to recycle resource objects. In addition to the results of pstack, we should generally know that it is to manage thread resources. But the specific code is not further analyzed.

Next, we will analyze our applications:

Stress Testing (high intensity, long time) has been done, and no exceptions are found.

Is there any defect in the application itself that causes the problem. Some people say that the VM thread takes into account the GC work, so the memory leakage and the long backlog of objects may also affect it. However, in the dump results, we can see that the GC has a separate working thread, at the same time, I also observed the working duration of GC threads. Therefore, CPU usage due to GC busy can basically be ruled out.

In the SIP Project, JDK's thread pool (executorservice) and program blockingqueue are used. The latter mentioned in previous articles that using the poll method in version 1.5 may cause memory leakage. Although there is no memory leakage in version 1.6, the increase of temporary lock objects will lead to an increase in GC frequency.

Action:

The above scattered analysis finally gave me the following actions:

1. Upgrade the JDK of a server. The current version is 1.6.0 _ 10-b33. We plan to upgrade it to version 14 of 1.6. Compare and observe the performance of multiple machines to see if JDK is upgraded to solve the problem.

2. Remove the linkedblockingqueue as a message queue. The producer directly distributes the production results to the consumer thread according to the algorithm to avoid competition and lock consumption. At the same time, it also prevents the resource consumption caused by linkedblockingqueue.

3. The test environment continues to carry out stress tests for a long period of time. You can also use tools such as jprofile to analyze possible problems that may occur after a long period of time.

After that:

I really want to learn a little about this year, so it's better to ask for your own.

Sa, DBA, and test all need to be able to learn something. At least you can do something yourself in the early stage of troubleshooting. Otherwise, others will be busy and you will not be able to start. It is like this stress test is not easy to team up, but it is still unable to meet the demand for timely release, so we can press the LoadRunner on our own, so we can give a zero-time report. Let's look at it first. Application exceptions are sometimes caused by application design issues, development languages, or operating systems. Therefore, we need to locate such complicated problems, I really need to be patient and study all kinds of knowledge well. Now it seems that the knowledge is still lacking. Otherwise, I can analyze the possible problems in openjdk.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.