Analysis of Too Many open files problem triggered by Tomcat class loading mechanism

Last Update:2017-01-13 Source: Internet

Author: User

Tags redis socket java web

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Song Shun, the technical expert of Ctrip framework Research and Development department. At the beginning of 2016 to join Ctrip, mainly responsible for middleware products related research and development work. Graduated from the Department of Software Engineering, Fudan University, has been working in the public comments, as the background system technology leader.

Speaking of Too many open files this error, presumably everyone must be not unfamiliar. In Linux, this error is thrown if the program opens a file handle (including network connections, local files, and so on) that exceeds the system settings.

However, the recent discovery of Tomcat's class loading mechanism in some cases also triggers the problem. Today to share the problem of the process of troubleshooting, the cause of the problem and the follow-up optimization of some measures.

Before you formally share it, let's briefly introduce the background.

Apollo Configuration Center is the research and development Department of Ctrip (the author of the department) launched the configuration management platform, provided the configuration of central management, configuration modification, real-time push and other functions.

A Java Web application is connected to the Apollo Configuration center, so after the user has modified the configuration center, the configuration center will push the latest configuration to the application in real time.

First, the phenomenon of failure

One day, development in the production environment as usual through configuration center modified application configuration, found that the application began a large number of errors.

Check the log, found that the original is Redis can not get to the connection, so the interface caused a large number of errors.

caused by:redis.clients.jedis.exceptions.JedisConnectionException:Could not get a resource from the pool
At Redis.clients.util.Pool.getResource (pool.java:50)
At Redis.clients.jedis.JedisPool.getResource (jedispool.java:99)
At Credis.java.client.entity.RedisServer.getJedisClient (redisserver.java:219)
At Credis.java.client.util.ServerHelper.execute (serverhelper.java:34)
... More
caused by:redis.clients.jedis.exceptions.JedisConnectionException:java.net.SocketException:Too many open files
At Redis.clients.jedis.Connection.connect (connection.java:164)
At Redis.clients.jedis.BinaryClient.connect (binaryclient.java:82)
At Redis.clients.jedis.BinaryJedis.connect (binaryjedis.java:1641)
At Redis.clients.jedis.JedisFactory.makeObject (jedisfactory.java:85)
At Org.apache.commons.pool2.impl.GenericObjectPool.create (genericobjectpool.java:868)
At Org.apache.commons.pool2.impl.GenericObjectPool.borrowObject (genericobjectpool.java:435)
At Org.apache.commons.pool2.impl.GenericObjectPool.borrowObject (genericobjectpool.java:363)
At Redis.clients.util.Pool.getResource (pool.java:48)
... More
caused by:java.net.SocketException:Too many open files
At Java.net.Socket.createImpl (socket.java:447)
At Java.net.Socket.getImpl (socket.java:510)
At Java.net.Socket.setReuseAddress (socket.java:1396)
At Redis.clients.jedis.Connection.connect (connection.java:148)
... More
Because the application is a basic service, there are many top applications that rely on it, leading to a series of chain reactions. Under the feeling of being nasty, only the Tomcat on all the machines was restarted again and failed to recover.

Second, preliminary analysis

Since the timing of the failure coincided with the configuration Center modification configuration, we were contacted immediately to help troubleshoot the problem (the configuration center is maintained by US). However, when we were notified, the failure has been restored, the application has been restarted, so there is no scene. Had to rely on logs and CAT (real-time application monitoring platform) to try to find some clues.

From CAT monitoring, the application cluster has a total of 20 machines, but in the configuration of the client to obtain the latest configuration, ready to notify the application of changes to the configuration details, only 5 notifications succeeded, 15 notifications failed.

15 of them failed to notify the failing machine that the JVM seemed to have problems, reported an error that failed to load the class (Noclassdeffounderror), and the error message was caught and recorded to CAT.

5 successful information is as follows:

The failure of 15 units is as follows:

The error details are as follows:

Java.lang.noclassdeffounderror:com/ctrip/framework/apollo/model/configchange
...
caused By:java.lang.ClassNotFoundException:com.ctrip.framework.apollo.model.ConfigChange
At Org.apache.catalina.loader.WebappClassLoader.loadClass (webappclassloader.java:1718)
At Org.apache.catalina.loader.WebappClassLoader.loadClass (webappclassloader.java:1569)
... More
Configure the client to calculate changes to the configuration and notify the application when the update is configured. Configuration changes are stored through the Configchange object.

Since it was the first time that the application was configured to change after it was started, the Configchange class was used for the first time, based on the lazy loading mechanism of the JVM, triggering a class loading process.

Here's a question, why can't the JVM load the class? This class Com.ctrip.framework.apollo.model.ConfigChange and configures the client other classes are hit in the same jar package and should not appear noclassdeffounderror.

And, coincidentally, the follow-up Redis newspaper can not connect the wrong is the 15 noclassdeffounderror machine.

Lenovo to the previous error Too many open files, will also be due to insufficient number of file handles, so that the JVM can not read from the file system jar package, resulting in noclassdeffounderror?

third. Causes of failure

With regard to the problem with the application, there are indications that the period should be due to insufficient process handles, such as inability to load files locally, inability to establish redis connections, inability to initiate network requests, and so on.

One of our applications had this problem when we found that the max Open Files setting for the old machine was 65536, but Max OpenFiles on a new batch of machines was mistakenly set to 4096.

Although later, Yun-dimensional help to fix the problem of this setup, but need to restart before it will take effect. So the current production environment also has a considerable part of the machine's Max Open Files is 4096.

So we landed on one of the faulty machines to see if there was a problem. However, the application of the problem has been restarted and cannot obtain the circumstances of the application. However, the machine also deployed another application, the PID is 16112. By looking at the/proc/16112/limits file, you find that the Max Open files inside are 4096.

So there is reason to believe that when the application of the problem, its Max Open Files is also 4096, once the number of handles to reach 4096, it will cause all subsequent IO problems.

So the reason for the failure is found, because the set of Max open files is too small and once the process has opened a file handle of 4096, all subsequent requests (file IO, network IO) will fail.

Because the application relies on Redis, the inability to connect to Redis for a period of time can result in a large request timeout, causing the request to accumulate and enter a vicious circle. (Fortunately, the SOA framework has a fusing and limiting mechanism, so the problem impact lasts only a few minutes)

Fourth, the puzzle

The cause of the failure is found, the parties seem to be satisfied with the answer. However, there is still a question lingering in the heart, why the fault occurred so coincidentally, occurs after the user through the configuration Center Release configuration?

Why does the system open a file handle that is less than 4096 before the configuration is released, exceeding after the configuration is released?

Does configuring a client open a large number of file handles after the configuration is published?

4.1, Code Analysis

By code Analysis of the configuration client, after the configuration center push configuration, the client does the following several things:

1. The previously disconnected HTTP long polling will be reconnected

2. There will be an asynchronous task going to the server to get the latest configuration

3. Get to the latest configuration will be written to local files

4. After a successful write to the local file, the diff is computed and notified to the application

As you can see from the steps above, the 1th step will re-establish the disconnected connection, so the new, 2nd and 3rd steps will be short for each new file handle (one network request and one local IO), but will be released after execution.

4.2. Try to Reproduce

The code looked at it several times and did not see the problem, so I tried to reproduce it, so I started a demo application (command-line, non-web), tried to reproduce the configuration release, and opened the file information in real-time with the bash script for subsequent analysis.

For i in {1..1000}
Todo
Lsof-p 91742 >/tmp/20161101/$i. Log
Sleep 0.01
Done
However, after many local tests have not found the increase in the number of file handles, although cleaning the configuration of the client, but the answer seems to be still floating in the wind, how to explain the observation of the various phenomena?

fifth. the VISTA

When you try to reproduce the problem yourself, there is only one last move--the problem is reproduced directly through the application program.

In order not to affect the application, I put the application of the war package together with the Tomcat used in the test environment and a separate deployment. Don't want to find out quickly the cause of the problem.

It turns out that Tomcat has its own webappclassloader for WebApp, which opens the application-dependent jar bundle to load class information during startup, but will then shut down all open jar packages and release resources over time.

However, if you need to load a new class later, all of the previous jar packages will be reopened and then the corresponding jar can be loaded. After loading, it will be released again after a period of time.

So the more jar packages that the application relies on, the more open the number of file handles.

At the same time, we found the above logic in Tomcat's source code.

The biggest problem with the previous replay experiment was that there was no full replay of the problem, and if Tomcat was tested directly, the cause could be found earlier.

5.1. Reproduce environment analysis

5.1.1 Tomcat just finished booting up.

The number of handles opened by the process just started is 443.

Lsof-p 31188 | Wc-l
5.1.2 Tomcat started up after a few minutes

After a few minutes of startup, check again and find that there are only 192 left. After careful comparison of the differences, we found that the jar package handle web-inf/lib under the full release.

Lsof-p 31188 | Wc-l
Lsof-p 31188 | grep "Web-inf/lib" | Wc-l
2 seconds after 5.1.3 configuration released

And then through the configuration Center made a configuration release, see again, found suddenly rose to 422. The difference is exactly the number of jar package handles under Web-inf/lib. As you can see from the following commands, there are more than 228 handles for the jar package file under Web-inf/lib.

Lsof-p 31188 | Wc-l
Lsof-p 31188 | grep "Web-inf/lib" | Wc-l
5.1.4 Configuration release 30 seconds later

After about 30 seconds, the jar bag handle in the Web-inf/lib was all released.

Lsof-p 31188 | Wc-l
Lsof-p 31188 | grep "Web-inf/lib" | Wc-l
5.2 Tomcatwebappclassloader Logic

By looking at the logic of Tomcat (7.0.72 version), it also confirms our experimental results.

5.2.1 Load class Logic

When Tomcat loads class, it first opens all the jar files and then loops through the corresponding jars to load:

5.2.2 Turn off Jar file logic

There is also a background thread that periodically performs a file shutdown to free resources:

5.3 Fault Scenario Analysis

For applications that fail, because the first configuration change after the application is started, a class:com.ctrip.framework.apollo.model.ConfigChange that is not previously referenced is used, triggering the Tomcat class load And eventually open all the dependent jar packages, resulting in a higher number of process handles in a very short period of time. (For this application, the number previously tested was 228).

Although there is no way to know the total number of file handles before the application goes wrong, the number of optical TCP connections (established and timewait) is 3200 +, plus some JDK-loaded class libraries (which Tomcat will not release) and the Documents, etc., should be from the upper limit of 4096 is not much difference. So if Tomcat opens the local 228 files at once, it can easily lead to Too manyopen files.

sixth. Summary

6.1 Problem Causes

So, the analysis here, the whole context of the matter is clear:

1. Application Max Open Files limit set to 4096

2. The application of its own file handle the number of higher, has been close to 4096

3. User in the configuration center operation of a configuration release, because of Tomcat class loading mechanism, will cause instant open local more than 200 files, thus quickly reached 4096 upper limit

4. Jedis is required to re-establish the connection with the Redis during the run, but the connection failed because the number of file handles has exceeded the upper limit

5. Application of external services due to the inability to connect Redis, resulting in a request timeout, client request backlog, into a vicious circle

6.2 Follow-up optimization measures

Through this problem, we not only to Too many open files This problem has a deeper understanding of the usual not too concerned about the Tomcat class loading mechanism also has a certain understanding, but also a simple summary of the future can be optimized place:

1. Operating system Configuration

From this example can be seen, we should not only care about the application of the internal, some of the system settings also need to maintain a certain concern. such as the Max Open Files configuration, for ordinary applications, if the flow is not large, use 4096 estimated also OK. But for the use of external services, 4096 is too small.

2, the application of monitoring, alarm

The application of monitoring, alarm still have to keep up. For example whether later can increase the application of the number of connections to monitor the index, once more than a certain threshold, the alarm. So as to avoid sudden system problems and be passive.

3, the middleware client initialization early

In view of the class loading mechanism of TOMCAT, the middleware client should do initialization action when the program is started, and load all the classes at the same time, so as to avoid some strange problems caused by loading class in the subsequent running process.

4, Encounter trouble, do not panic, keep the scene

Production environment encountered a failure, do not panic, if the problem can not be resolved at a moment, through restart to resolve. However, you should keep at least one problematic machine, which will provide a good clue to the problem behind troubleshooting.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More