1 problem description:
The following is a graph of the number of threads after the application goes online:
Observe the number of threads: increase from 0.2k to 1.4 k, and then suddenly change to 0.2 k. All the added threads are Damon threads.
Have the following questions:
1: Why is there a thread leakage when the number of threads keeps increasing?
2: At a certain point in time, the number of threads suddenly dropped by 1.2 k, why?
1.1 thread leakage troubleshooting
Through jstack printing stack information, it is found that most threads are described as follows. (Some stack information is omitted)
"MySQL
Statement Cancellation Timer"
#2952 daemonin
Object. wait ()
Java. lang. Thread. State: TIMED_WAITING (on object monitor)
At java. lang. Object. wait (Native Method)
At
Java. util. TimerThread. run (Timer. java: 505)
The thread name indicates the mysql thread (this also shows how important it is to start a name that is easy to recognize when a custom thread is used ).
It can be seen from the information description that it is a JDK Timer thread, and the thread is in the TIMED_WAITING state.
Question:
What is the Statement Cancellation Timer thread? Why is it a Timer thread?
1.2 check for sudden decrease in the number of threads
First, observe the various resource consumption (such as cpu, memory, and jvm information) of the current system to see if some information can be obtained, the time points of YGC are exactly the same as those of the sudden decrease in the number of threads. As shown in the following figure (GC graph)
Question:
Does YGC give the TIMED_WAITING state thread to GC? A very challenging understanding of GC.
2 Basic Knowledge
2.1 heartbeat detection of database connection pool
1: The data connection pool manages and maintains connections to ensure connection availability. To ensure the availability of the connection, you need to heartbeat the connection.
2: The heartbeat is generally divided into two layers: tcp-layer heartbeat and application-layer heartbeat. The tcp layer uses keepalive for heartbeat, while the application layer uses the application layer protocol for heartbeat.
3: Some colleagues asked: TCP has a keepalive heartbeat mechanism. Why does it still need the heartbeat of upper-layer applications. (We recommend that you use the heartbeat at the application layer to keep the job alive)
Restrictions of keepalive:
Ensure TCP-level persistence. If the application fails but the port is still normal, the connection is still valid.
For IO applications, if the TCP layer fails to survive, the socket will be closed, but the upper-layer applications cannot perceive it, and the socket will still be considered valid.
Weak handling.
Advantages of application layer persistence:
Flexible and highly operable.
You can set a heartbeat policy for business features.
The connection is valid.
4: The database connection pool uses the application layer heartbeat to maintain the activity. There are three main heartbeat methods.
Statement
The heartbeat status of the database is queried. The connection pool provides an SQL configuration for heartbeat check, for example, select 1. Statement.exe cute ("select 1") will be executed ").
Ping
Mysql driver provides a ping method. Mysql provides a ping command (similar to the select command) for heartbeat detection.
IsValid
Jdbc provides the isValid interface, which is implemented by the mysql driver using the native ping command of mysql.
Comparison
Statement ping interface isValid interface
The select 1 parameter has no timeout parameter.
Low performance, high performance
Interface provides JDBC native mysql driver private JDBC4 native
Return value boolean void boolean
Heartbeat failure, SQLException, SQLException, and false
Summary:
Note: The ping protocol refers to the mysql protocol. The ping interface is used internally by the driver to send the ping protocol.
1: mysql native ping protocol has nearly twice the performance of statement (I have tested it). We recommend using the native ping protocol.
2: The ping interface is not a JDBC interface, and the ping protocol is enabled internally. If an error occurs, the system throws the error.
3: The internal implementation of isValid is also the ping method. It only encapsulates try catch. If try passes, true is returned, false is returned in catch, and the current Connection status is set to false (this will be explained later ).
2.2 Service timeout handling scheme
Generally, when you access a third-party interface or an interface that may be executed for a long time, a timeout value is usually set for fault tolerance.
For example:
1: Access the http interface and set the timeout time. If no data is returned at the specified time, a timeout exception is reported.
2: run the database command and set the timeout (call the setQueryTimeout interface of statement). If no data is returned at the specified time, a timeout exception is reported.
2.2.1 timeout control process:
1: before accessing the interface, encapsulate the current status as a scheduled task and set the start time of the task through the scheduled framework.
2: After receiving the returned data, the scheduled task is removed.
3: If data is not returned within the specified time, the scheduled task starts. Processing policy: (1) business processing (2) throwing a timeout exception.
4: business processing is generally: logging, sending the cancel command to the server (mysql driver is the SQL statement that sends the kill query connectionid to the database end to cancel execution ).
2.2.2 timing framework
The timing framework JDK has two main options: Timer and ScheduledExecutorService.
(1) Timer
There are two main methods: schedule and scheduleAtFixedRate. The parameters are: (TimerTask task, long delay, long period ).
Difference: for example, if the period is 5 s, schedule is the next execution after the current task is completed in 5 s. ScheduleAtFixedRate is a fixed 5 s task interval, including the task execution time.
It consists of three components:
TimerThread: the thread used to run tasks.
TaskQueue: stores all tasks. This queue ensures that the tasks are stored in sequence according to the execution time.
TimerTask: a running task is a Runnable that mainly controls the task status and execution time.
The figure above only roughly describes the Timer structure, and the logic is not very rigorous.
1: Call the schedule method to schedule tasks. After a task is added to a queue, the queue is rearranged to ensure that the first task is at the beginning of the queue.
2: When cancel is performed on a task, the status of the task is set to cancel. When the thread reads the queue, if the status of the task is cancel, it is removed from the queue. (Many cancel operations are designed in this way)
3: timerThread asynchronously reads the queue (tasks are stored in the order of execution). If the queue is empty, wait. Otherwise, the task is executed.
4: When the Timer cancel is called, a tag bit is set. The TimerThread execution is complete (that is, the exit of the TimerThread thread ). If an exception occurs during task execution, the TimerThread thread also exits (you must pay attention to the exception handling of the task when using Timer)
5:? .
(2) ScheduledExecutorService
The principle of ScheduledExecutorService is not described in this article (in fact, the principle of timing implementation is similar to that of Timer ).
The following describes the differences between the two:
1: thread model: Timer is a single thread. One thread polls and executes all tasks. ScheduledExecutorService is a multi-thread, which is actually an Executor in essence.
2: exception check: Timer does not perform an exception check on the running task. If the task encounters an exception, the Timer thread stops running. ScheduledExecutorService will capture and handle exceptions
To set a scheduled task, we recommend ScheduledExecutorService.
2.3 Caelus connection pool
The distributed database connection pool developed by Vipshop platform architecture. It mainly solves the following problems:
1: High Performance: design models based on lockless connection pool to improve connection pool performance;
2: Reduce the number of threads in scenarios with a large number of database shards. If there are 128 sub-databases, the existing connection pool model requires 128 independent connection pools. Each connection pool requires threads (1-4, different connection pools) to process tasks. A total of 128 to 128*4 threads need to be maintained, resulting in huge overhead. The Caelus connection pool greatly reduces the number of threads.
3: connection multiplexing. For a mysql instance, there are multiple schema scenarios. Connections of different schemas in the existing connection pool cannot be reused. Caelus can reuse connections of different schemas to improve performance.
4: Too many transaction commands. If it is a transaction statement, you need to enable the transaction (set autocommit = 0) after obtaining the connection from the connection pool, and set it again when returning the transaction (set autocommit = 1 ). Each time a connection is used, two additional transaction commands are executed. Caelus can effectively reduce transaction commands.
The details of Caelus are not described here.
3. Thread leakage parsing
The reasons for the increasing number of threads will be resolved below.
3.1 problem description
1: When the JDBC interface is used, if queryTimeout is set, the driver starts a Timer thread for timeout control. The Timer thread is the added thread that has been crawled through jstack: MySQL Statement Cancellation Timer.
2: It can be inferred that the application must have set queryTimeout. By viewing the user configuration, queryTimeout is set for each query.
3: The Timer thread is connected to the dimension. Each time a connection is created, if timeout control is performed, a new thread is created. When the connection is physically closed, the thread is closed. (Generally, the system connection timeout control is set globally. You only need to enable a Timer thread to run the timeout task ). Mysql driver enables a Timer thread for each connection, which is less efficient and has frequent operations to enable and disable threads. If the number of connections is too large, it will also cause the problem of excessive threads. Optimization can be considered here.
4: The method for physically closing the Connection is to call the Connection. close method. This method will first determine whether the current Connection is not in the close state, not the close state, and then cancel off the Timer thread.
Since the Timer thread is sent to cancel when the connection is closed, why does the number of Timer threads keep increasing and all threads are waiting. Next, we will analyze the operations that will close the connection.
3.2 scenario of disconnection
1: minIdle is usually set in the connection pool. If the number of idle connections exceeds minIdle, the connection will be closed. (For the configuration of the connection pool, refer to: http://blog.csdn.net/hetaohappy/article/details/51861015)
2: Set minEvictableIdleTimeMillis in the connection pool, that is, if the connection remains idle for the specified time, the connection will be closed.
3: If an I/O exception occurs during access, the connection will be closed. Mysql sets wait_timeout, that is, the idle timeout time for mysql connection. If the idle time-out occurs, mysql closes the connection. Because the access end is an I/O mechanism and cannot perceive socket shutdown, I/O exceptions will be reported when the socket is accessed.
The previous two scenarios do not reproduce bugs. The third scenario can reproduce bugs. The recurring process is;
1: create a connection, set queryTimeout, and then access the database.
2: kill the corresponding connection to the database (similar to idle timeout)
3: the client uses the connection for further access. A Timer thread is found.
3.3 cause analysis
Before executing the business SQL statement, the heartbeat check is performed first. The isValid interface is used for heartbeat. The isValid method performs heartbeat first. If the heartbeat fails, the Connection status is set to close.
If the Connection pool finds a heartbeat failure, it will call the close interface of Connection because the Connection status has been set to close. If the connection status is close, the system returns the result directly. This causes no cancel to drop the Timer thread.
Existing connection pool Troubleshooting:
Whether thread leakage exists in the connection pool
DBCP has
Druid none
HicariCP has
Caelus has
The connection pool with thread leakage uses isValid for heartbeat check. Druid adopts the ping method of mysql driver, so there is no problem. This can be changed to the same mechanism as Druid (ping method) for heartbeat checks.
4. Analysis of sudden decrease in the number of threads
During YGC, the number of threads suddenly dropped a lot. Theoretically, GC has nothing to do with the thread. It should be the condition triggered by YGC, and the condition will shut down the thread.
4.1 Timer thread shutdown scenario
1: Call the cancel method of Timer;
2: task running exception (Timer does not encapsulate the exception ).
Check whether YGC and the preceding two scenarios are disabled.
4.2 cause analysis
First, analyze the main interaction scenarios of GC and applications:
1: weak reference: after GC, the weak reference will be recycled;
2: finalize: The finalize method of the GC object is called.
By analyzing the Timer source code, we did not find any weak reference, but found that Timer has a threadReaper attribute to implement the finalize method. This method is to end the Timer thread (that is, to close it ). It is clear why the number of threads decreases in YGC. (It can be seen that the JDK source code has many considerations for various exceptions. Studying the design idea of the JDK source code can bring us a lot of reference for writing code)
The process of thread shutdown caused by YGC:
YGC --> reclaim the Connection object-> reclaim the Timer object --> reclaim the threadReaper object-> Execute The finalize method of threadReaper-> set the thread end state, and run notify-> Timer.
5. Summary
The problem occurs in two places:
1: The isValid method of the mysql driver has a bug. As a result, the Timer thread is not closed.
2: The isValid method is selected for heartbeat processing in the connection pool. You can use the ping method of mysql driver.
This problem can also be avoided through configuration.