Cause analysis of excessive number of Java threads

Source: Internet
Author: User

Deer pill will not be a polynomial source: reprint please contact me first.

First, the problem description

Pre-array We because of the B machine room fault, will all the traffic cut to a room, after the push + natural peak, a room all the servers have Java threads approaching 1000 (1000 is the set max value), at night 7 o'clock more observation, Java thread number slightly decreased, but still have 900 +, and at this point, the TPS for a single server is maintained at 400/S, not a particularly large amount. Then a machine room a computer offline, continue to observe, to the night of more than 9 o'clock, the machine, the jetty process Java thread number is 7, 800 look. At the same time, another machine in the same engine room, after restarting jetty, in the same tps400+ situation, the number of threads has remained at only more than 300. This is very strange, the offline machine is not requested, the number of Java threads is still so many? Will it be multi-threaded to compete for resources that cause congestion? began to study the problem.

Second, save the scene to save the scene I first reviewed and recorded the jetty process at the time of the number of threads, DB connections, CPU load, memory usage, TPS, nginx connections, jetty error log, GC status, TCP connection status, etc., are normal. Then I use the jstack command to export all thread snapshots of the current JVM, export the current Java heap dump snapshot with the Jmap command, and find that other metrics are normal except for the number of Java threads. Here first, jstack and jmap use common parameters examples and precautions: Find the jetty process PID for both commands, first of all need to find out the corresponding PID jetty process, of course, you can use the JPS command to find the corresponding PID. However, my current Linux users are their own user name users, and the company extranet server I do not have jetty permissions, that is to say JPS command only jetty users can view. On the Internet to find the next information, here I use this way, with PS Aux|grep Jetty.xml found the jetty process corresponding PID. Jstack Save thread Snapshot I use the Jstack complete command is: "Sudo-u jetty/data/java/bin/jstack-l pid >> ~/jstack.txt". The-l parameter prints the lock information as well. Here, there is a relatively hidden pit, our jetty process is jetty users. An error will occur if the Jstack-l PID is executed directly by the root user or another user on Linux. Therefore, you need to add sudo-u jetty in front of the command, using the jetty account to execute the command. The jetty account does not necessarily add/data/java/bin to the environment variable, so the full path of the jstack is required. The completed results are stored in the Jstack.txt file in the home directory. (Here is the assistance of the OPS colleague) Jmap save heap Dump snapshot Similarly, the Jmap command also needs to be aware of the user who executed the command. The complete command I used was: "Sudo-u jetty/data/java/bin/jmap -dump:format=b,file=~/jmap.hprof   pid". The Hprof file is very large and holds a snapshot of the objects in the heap at that time. Hprof can not read directly, need to use special tools to analyze. The most commonly used are mat and jhat. Mat is a graphical interface tool, with Windows version, more convenient. But the mat has a dead hole, and when the parsed hprof file is too large, a memory overflow error occurs, resulting in a failure to get results. I've tried to solve this problem,However, no effective method has been found. So here I'm using Jhat. Jhat Analysis Heap Dump snapshot jhat is a Java-brought command-line tool that is relatively simple. But for a particularly large file, it seems to be the only option. By compressing the hprof file and downloading it to a virtual machine on the development environment, you can begin to analyze it with jhat. The complete command I used was: "Jhat-j-d64-j-mx9g-port jmap.hprof". To explain the parameters. -J-D64: Because the jetty process is running on a 64-bit system, you need to specify 64 bits. -J-MX9G: Indicates that the jhat process can allocate up to 9G of heap memory, which is why Jhat can parse a large file because the heap memory size can be specified. -port 5000:jhat After the analysis is complete, a Web service is started that can be accessed by specifying a port, which is the specified port. The   parameters are finished, but there is a problem with such a command. The above command finishes, the Jhat process is at the foreground. In other words, if your Ctrl + C (or Xshell connection timeout) ends the foreground process, then the Web service provided by Jhat is over, and the file you just analyzed for so long has to be restarted. To solve this problem, use the nohup and & combinations on Linux. By order "Nohup Jhat-j-d64-j-mx9g-port Jmap.hprof &, you can put the process into the background to execute. Interested can study nohup, here do not repeat.  Jhat analysis takes some time. You can use the top command to see that the analysis is over when the jhat process is not eating the CPU crazily. At this point, you can access the results you have just analyzed by Ip:port. Third, the positioning problem first, to see just jstack.txt. Within nearly 900 threads, there are 600+ threads that wait on the same object <0x0000000734afba50>, and the call stack for this 600+ thread is identical. To check it out, this org.eclipse.jetty.util.thread.QueuedThreadPool function is the jetty worker thread pool. Whenever a request arrives, Jetty creates a new thread from the Queuedthreadpool or takes an idle thread to handle the request. See "at Org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll (" Inside the call stack, It feels like these threads are waiting for the task to be processed. Of course, this is a guess. To verify this conjecture, find the snapshot results of the heap that you just jhat have analyzed well. First, I found "class Org.eclipse.jetty.util.thread.QueuedThreadPool"This class, then click to enter into the only instance of Queuedthreadpool.

Here, you can see all the member variables in the Queuedthreadpool object:

 of these, there are two types of atomicinteger variables that need to be cared for here: _threadsstarted and _threadsidle. _threadsstarted represents the number of threads currently owned by Queuedthreadpool, while _threadsidle represents the number of idle threads in Queuedthreadpool. Click inside to see the values of these two member variables, value 707 and 613, respectively. This means that Queuedthreadpool currently has 707 threads open for processing user requests, and 613 of them are idle.  here, the above assumptions are basically verified. Those heavy time_wait threads are really in the state of waiting for the request to come. So the question is, since it is idle thread, why jetty not recycle, time_wait have such a long time? Iv. Analysis of Jetty source code, determine the causeto continue to determine why idle threads are not being recycled, analyzing jetty source code is a way of thinking. I only found the same as the jetty big version of the line, the small version close to the jetty source code. But it does not prevent the logic of this part from being cleared up. continue to return to the results of the jstack just now:Everyone familiar with the blocking queue knows that the "Org.eclipse.jetty.util.BlockingArrayQueue.poll" in the Stack ("is to get the task from a specified blocking queue. If no task is available in the blocking queue at this time, the thread is blocked until a task in the queue is desirable or timed out. If timed out, the poll method returns a null value. Into the Idlejobpoll () method, it is also easy to find that the poll time-out is also using the _maxidletimems variable, that is, 60s. So it happens that 600+ threads wait for one condition at a time. These threads are waiting for the blockingarrayqueue task to come to this condition. So, who made the thread call poll? Why does the poll set the time-out period and the thread does not end after the timeout? Why are these idle threads not being recycled in a timely fashion? With these questions, we look at the source code of Queuedthreadpool. We directly found the code that called the Ideljobpoll () method, as follows: Here are a few variables and methods that need to be explained first. _maxideltimems is a member variable in Queuedthreadpool that represents the number of milliseconds to time out, the default value is 60000 (for 60 seconds), and you can query for that value in the results of the jhat analysis just now. _lastshrink is also a member variable of Queuedthreadpool, a thread-safe Atomiclong type that represents the timestamp of the last thread's exit and is shared by threads in all thread pools. The Campareandset method is the well-known CAS (compared to the assigned values). For example: _lastshrink.compareandset (last,now) means to compare _lastshrink and last to see if it is the same, and the same will _lastshrink the value equals now and returns True. Otherwise, no assignment is made and false is returned. When an idle thread has timed out from the Ideljobpool () method, it gets to the null value and enters the while loop again. The number of threads at this point, size (700+), is to be greater than _minthreads (set to 200), so the if code block in the box is entered. The following steps are mainly experienced in the If code block:

1. Assign last to the timestamp at the time the thread in the previous thread pool exited, assigning the current time to now.

2. Then "if (last==0 | | (now-last) >_maxidletimems)"This sentence determines if now the previous thread exits from the Maxidletimems (60000, 60 seconds).

3. If the conditions in the 2 step are true, the _lastshrink is re-assigned to the current time and the thread count in Queuedthreadpool is reduced by one.

Campareandset guarantees that only one thread at a time can be assigned a successful value.

Once the assignment succeeds, it will return, and the thread will end up jumping out of the while loop.

For a thread that is not successfully assigned, the execution continues to Idlejobpoll (), similar to step 4.

4. If the condition in the 2 step is not true, it will go back to Idlejobpoll (), block the thread, and try to get the task from the blocking queue.

that is, whenever an idle thread executes the code in the box, it is determined that the current distance from the last wire exit is more than 60s. If it does not exceed 60s, the thread will continue to go back to the Ideljobpool method to wait for the task. In other words, within 1 minutes, Queuedthreadpool can only allow a single thread to exit. So, we 600+ a free thread if we want to exit all, it will take 600 minutes, that is, 10 hours !! No wonder there are so many idle threads out there ah, although these idle threads can be reused and do not affect the business, but are also consuming thread resources. I don't know if this is a bug, but it's really a pit. It is easy to be misled by its influence over the number of Java threads to judge the busy situation of the business.Five, experimental verification

To further verify this conclusion, I deployed the same business in the development environment, pure and no one else was accessing it. Initiate 30,000 requests with AB with 1000 concurrency and quickly increase the number of Java threads to 1000 (maximum). Then use the Watch command to observe the number of Java threads every 5 minutes, and here are some of the results:

As you can see, the number of threads drops by 5 every 5 minutes. It's really 1 minutes out of a thread!Vi. ConclusionThe most important conclusion of this whole process is that you don't have to worry too much when you find that the number of Java threads is very high. It is possible that only jetty did not reclaim the idle thread in time. More importantly, to grasp the tools and methods of analysis, to find out the reasons behind the phenomenon.

High number of Java threads causes analysis

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.