Bulk Thread Blocking

Source: Internet
Author: User
Tags throwable

17 There were failure conditions on March 6.
The 19,20 all stopped. The 10,2 also stopped.

March 17, 2017 00:29:32
The analysis may have been a crash of monitor in the Batchrunner class.
The phenomenon is: the batch machine scheduler, is normal, and all the time to start. However, the batch is not executed. Instead of a batch machine, there is a single piece of data due to the core's day-cut invocation. (There is one every other day, because of the load)
The conjecture is that Runbatch encountered some sort of error while executing the 17 queue, causing the thread to hang. But Runbatch captures the throwable, theoretically, all the anomalies will be captured.

or Runbatch no error, but has been blocked. If this is the case, it can also explain the phenomenon

17 batch, the first step shows the state is running, the next few steps are to be run.
You need to check if the data has a lock table. (But if there is a lock table, why does the other machine have success data?) Is it possible to have a lock conflict in a standalone transaction? )

It is estimated that the problem is where it is. But the reason is unclear.
Httpclientimpl if (= = Connection.getresponsecode ()) This place has been blocked. Causes threads to block all the time, while pay batches are executed sequentially. In workstep executorservice executor = Executors.newcachedthreadpool (); see this method
public static Executorservice Newcachedthreadpool () {
return new Threadpoolexecutor (0, Integer.max_value,
60L, Timeunit.seconds,
New synchronousqueue<runnable> ());
}
This uses the synchronization queue, only one execution, and executes the next one. So after the thread is blocked, although Quarz has been sending content over, it is not executed. (http://blog.csdn.net/zbd_answer/article/details/20630719?locationNum=13)
This is consistent with all the phenomena.
Phenomenon:
1, Batch machine thread-420 thread, in the execution of the 17 queue 1701step, there is blocking, and then there is no thread-420 thread, and other similar threads.
2. Quart threads have been executed all the time.
3, the non-bulk machine 10 queue can be executed. (Core day will be notified when Pay,core access pay, is through the load access, the current strategy is to access the batch machine one day, access to the non-batch machine day)
4, a lot of places are the capture of throwable, so if the thread crashes, there will be a log. But there's no information in the logs.
5, Batch machine thread-420 thread, through the Pstack command, display the stack information (not familiar with the development of stack information, but very abstract information) is in the socketinputstream_socketread description is still waiting to read. Using netstat to query port 8091 (when accessing UnionPay, the proxy server port) found that there has been a connection (established), and no 17th transactions occurred.
In order to compare, the bulk thread of the non-bulk machine is doing the same, Pstack shows that the thread of the non-bulk machine is timewait. and query 8091 port, there is no connection, then do the trade of UnionPay, again query 8091 port, there is a connection, later, again query, the connection disappears, indicating that the connection will close itself, thus more can confirm that the batch machine must be blocked, so the connection has not been closed

Several points that were previously suspected:
1, the database has been locked? However, the non-batch machine can execute 10 batches, so it will not be caused by a database lock.
2. Quartz thread crashes? However, there is log information for this thread in the log of the batch machine.
3. The batch thread crashes? At the beginning, it was considered this, but careful analysis, because the capture of Throwable, but no log is very strange. In addition, there are thread-420 threads found in the batch machine. Of course, I'm not sure that this thread-420 thread is a bulk thread. But with the help of other information, there is great assurance that this is a bulk thread
4, the execution of the bulk of the time there is a transaction, is not two threads of the transaction conflict? This is not verified here, the first part of the phenomenon does not support this conjecture. Another thread blocks this conjecture, which I think is very likely to be this.


Reason:
So, what is the cause of the thread blocking?
I do not know, the Internet is said to be because there is no setting readtimeout, but it is clear that we set this parameter.
Http://www.4byte.cn/question/788682/httpurlconnection-getresponsecode-freezes-execution-doesn-t-time-out.html
This article also mentions setting readtimeout, but the thread is also freezes (Httpurlconnection.getresponsecode () freezes execution/doesn ' t time Out)
However, no one gave the reason
Specific reasons, still pending investigation
The next validation to do:
Execute a 17 queue locally, keep track of each step, change the database, and whether Workstep is what I think it is. (It is intended that it should be asynchronous to execute with thread pooling, but the thread pool is queued and synchronized.) March 18, 2017 13:41:04 It's also possible that the designer wants to reuse threads to circumvent thread overhead? )

On the production proxy server, check the local port of the bulk machine, if there is a proxy server, it indicates that there is a connection between the two parties, if not on the proxy server, the connection is lost?
Check, proxy server, and no bulk machine connection
In the 31 batch machine WebSphere Console, heap dump, Java core (execute 5 times, every 30 seconds), System dump

If you still can't find the cause, you need a adaption rich log, and a secondary log that is currently guessing why

Given this situation, it is necessary to add a heartbeat mechanism to the batch mechanism.

March 28, 2017

Compared to the production of the war package problem, found that the 31pay UnionPay transport.xml in the ConnectionTimeout and readtimeout parameters really did not. After the investigation, is the pressure package works will cover the changes.

Bulk Thread Blocking

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.