A Redis exception access case analysis that causes oom

Source: Internet
Author: User
Tags gc overhead limit exceeded

This article is from my independent blog, for a better reading experience, please click here

----------------------------------------------------------------------------------------------------

"The premise of inference is based on facts. 」

These two days encountered an online system of occasional sudden heap of memory spikes, this is not a problem, but some ideas in the process can be used for reference, so summarize and write down.

Phenomenon

The memory situation can be seen in the following monitoring chart.

A few times a day, and the duration is usually several minutes. When this happens, we check the error log and find the following two OOM errors.

    • java.lang.OutOfMemoryError: GC overhead limit exceeded
    • java.lang.OutOfMemoryError: Java heap space

The accompanying error log is the exception that is thrown by accessing Redis

    • JedisConnectionException: java.net.SocketException: Broken pipe

We observed a total of the phenomenon is probably so much, observed two days feeling phenomenon occurs no regularity, the duration is not long, a while the application will automatically return to normal.

Diagnosis

Through the above phenomenon, responsible for the development of the system and maintenance of the children's shoes may be network instability, java.net.SocketException: Broken pipe this anomaly seems to be a long connection to the Redis is interrupted, and the application of this problem is just our new deployment in a new IDC, it needs to visit the old IDC deployed Redis, And in the old IDC deployment of applications, there is no such phenomenon.

Although the two IDC between the high-bandwidth optical fiber connection into the LAN, but still more slowly than the same IDC, coupled with this associated application thrown network anomalies, it is easy to judge that the stability of the network environment may be different causes the difference in application behavior. What is the correlation between long connection interrupts that connect to Redis and apps that throw OOM? I didn't think there was a definite connection. And the network monitoring colleague also identified two IDC between the application anomalies in the network is stable, no packet loss, bandwidth is sufficient. Therefore, the reason for the inference of the network instability, it seems that people are not very understanding, difficult to convince.

And the application that can recover itself in OOM is not a memory leak, it should be a memory overflow. It might be that the application application's memory has exceeded the JVM heap capacity for a short period of time, resulting in the throwing of oom, and the two types of oom that are thrown from above are really like this, especially the hint GC overhead limit exceeded that this explanation points to the code may be problematic. Just how to find out which code has the problem, this has to first through the dump memory in OOM analysis, when the application starts with the following startup parameters to capture the OOM scene.

    • -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=mem.dump

Get the memory dump file, with jhat or MAT analysis, successfully found a thread at the time to apply for 1.6G of memory, and then follow the line stacks found the calling method, a look at the source immediately understand, the code is the method provides external interface services, method parameters from the external input, no The input parameters are judged by the security, but instead directly based on the input parameters, a super-Large Array (2000多万个 integer) is created, causing the OOM to be immediately triggered and continuously FULLGC for a period of time, so the memory curve will be like that.

Think about it again.

There's also a network exception that connects to Redis, what's going on? Back to the code to see, the original 2000多万个 integer array of integers, as the access to Redis command parameters ( hmget ). Here, the moment to understand is that, so long parameters have done server-side network programming is clear, the protocol resolution more than a reasonable length estimate will be rejected as a malicious client, and cause the server to reject the client, the behavior of the rejection is generally closed connection.

And then I picked up the Redis document, and I did find this description:

Query buffer hard limit every client was also subject to a query buffer limit. This was a non-configurable hard limit that would close the connection when the client query buffer (which is the the buffer we U Se to accumulate commands from the client) reaches 1 GB, and was actually only a extreme limit to avoid a server crash in Case of client or server software bugs.

Above that is, redis maximum acceptable command length is written to die in the code, that is, 1 GB, more than this nature was rejected. More about details see Redis official documentation

Summarize

I think from the case of two points learned:

    1. The phenomenon is not so reliable that it cannot be piecemeal.
    2. Start by doubting your own code first.

1th, should be a common sense, the doctor's diagnosis of the example fully illustrates this point. 2nd, why start with doubting your own code, simply that the business code of the application is often the least well tested and validated code. Business applications depend on the environment whether it is hardware (host, network, switch) or software (operating system, JVM, three-party library) These are often more tested and widely validated than business code, so start by doubting yourself, unless there is very clear evidence pointing to other aspects, Personal experience Most of the time this is the shortest path to finding a problem.

A Redis exception access case analysis that causes oom

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.