A fun experience with debugging Java memory leaks

Source: Internet
Author: User
Tags zookeeper

Everyone makes mistakes, but some mistakes are so absurd that I can't figure out how to make such mistakes. Even more unexpectedly, this kind of thing happened to us. Of course, this kind of thing can only discover the truth afterwards. Next, I'll cover a series of recent mistakes that have been made in one of our applications. Most interestingly, the first signs reveal a completely different problem from what actually happened.

In a bleak midnight

Just after midnight, I was woken up by an alarm from the surveillance system. Adventory, our PPC (Pay-per-click) ad System, an application that is responsible for indexing ads, is clearly restarted several times in a row. In the cloud environment, the restart of the instance is normal and does not trigger an alarm, but the number of times the instance restarts exceeds the threshold in a short period of time. I opened the laptop and plunged into the project log.

Must be a network problem.

I see that the service has timed out several times when connecting to ZooKeeper. We use ZooKeeper (ZK) to coordinate index operations between multiple instances and rely on it for robustness. Obviously, a Zookeeper failure prevents the indexing operation from continuing, but it should not cause the entire system to hang. And, this is very rare (this is the first time I've ever met a ZK in a production environment), and I think the problem may not be easy to fix. So I woke up the ZooKeeper on duty and let them see what had happened.

At the same time, I checked our configuration and found that the time-out of the ZooKeeper connection was seconds. Obviously, ZooKeeper is all hung up, because other services are also using it, which means the problem is very serious. I sent news to a few other teams who apparently didn't know about it.

The ZooKeeper team's colleague replied to me, and in his opinion, the system was functioning normally. As other users didn't seem to be affected, I slowly realized that it was not a ZooKeeper problem. The log was obviously a network timeout, so I woke up my colleague in charge of the network.

The team responsible for the network checked their monitoring and found no anomalies. Because a single network segment, or even a single node, could be disconnected from the rest of the nodes, they checked the machines on which our system instance was located and found no exceptions. In the meantime, I tried a few other ideas, but none of them, I also reached the limit of their intelligence. It was late (or very early) and, at the same time, it had nothing to do with my attempt, and the reboot became less frequent. Since this service is only responsible for data refresh and does not affect the availability of the data, we decided to put the problem in the morning.

Must be a GC problem.

Sometimes it's a good idea to put a problem in a place, sleep, and wait until the brain is clear. No one knew what was going on and the service behaved strangely. Suddenly, I thought of something. What are the main causes of the bizarre performance of Java services? Of course it's garbage collection.

In order to cope with this situation, we have been printing the GC log. I immediately downloaded the GC log and opened censum to start analyzing the log. I hadn't looked at it before, and I found a horrible situation: a full GC occurs every 15 minutes, and each time the GC triggers a service pause of up to 20 seconds. No wonder the connection ZooKeeper timed out, even if the ZooKeeper and the network are all fine.

These pauses also explain why the entire service has been dead, not just an error log after a timeout. Our service runs on Marathon, which periodically checks the health status of each instance, and if an endpoint does not respond for a period of time, Marathon restarts that service.

Knowing the reason, the problem is half done, so I believe the problem will be solved soon. To explain the reasoning behind it, I need to explain how adventory works, unlike your standard microservices.

Adventory is used to index our ads to ElasticSearch (ES). This requires a two-step process. The first step is to get the data you need. So far, this service has received events from several other systems that were sent through Hermes. The data is saved to the MongoDB cluster. The amount of data is up to hundreds of requests per second, and each operation is particularly lightweight, so even if some memory is triggered, it does not cost much resources. The second step is the index of the data. This operation is performed on a timed basis (approximately two minutes), and the data stored by all MongoDB clusters is collected through RxJava into a stream, which is combined into a non-canonical record and sent to ElasticSearch. This part of the operation is similar to an offline batch task, not a service.

Because it is often necessary to make a large number of updates to the data, it is not worthwhile to maintain the index, so the entire index will be rebuilt once per scheduled task. This means that a whole chunk of data passes through the system, causing a lot of memory recycling. Despite the flow, we were forced to add the heap to a size of three gigabytes. Since the heap is so large (and currently fully supported), our GC has chosen G1.

In the services I've worked on, I've also recycled a lot of objects with short life cycles. With those experiences, I added the default values for the new generation, so that the new generations would get bigger, and the young -XX:G1NewSizePercent -XX:G1MaxNewSizePercent GC would be able to process more data without sending them to the old age. Censum also shows a lot of premature ascension. This is also consistent with the full GC that occurs after a period of time. Unfortunately, these settings do not play any role.

Next, I think, perhaps producers are making data too fast for consumers to consume, causing the records to be recycled before they are processed. I try to reduce the number of threads in the production data, reducing the speed of data generation while keeping the data pool size that consumers send to ES unchanged. This is mainly the use of Back pressure (backpressure) mechanism, but it does not play a role.

Must be a memory leak.

At this point, a colleague who was still in the mind at the time suggested that we should do the first thing: check the data in the heap. We have prepared an instance of the development environment with the same amount of data as the instances on the line, and the heap size is roughly the same. By connecting it to a sample of the jnisualvm memory, we can see the approximate number and size of the objects in the heap. Looking at the big eye, we can see that the number of objects in our domain Ad is abnormally high and growing over the course of the index, growing to the level of the number of ads we're dealing with. But...... It's not supposed to be. After all, we've compiled this data into streams through RX, just to prevent all the data from being loaded into memory.

As the doubts grew stronger, I checked this part of the code. They were written two years ago and have not been carefully examined. Sure enough, we actually loaded all the data into memory. Of course it's not intentional. Since the RxJava was not fully understood at the time, we wanted the code to run in a special way in parallel. In order to peel off some work from the main workflow of RX, we decided to run with a separate executor CompetableFuture . However, we need to wait for all of the CompetableFuture work done ... By storing their references, they are then called join() . This leads to the completion of the index, all future references, and the data they refer to, all of which remain in the state of existence. This prevents the garbage collector from cleaning them up in a timely manner.

Is it so bad?

Of course it was a stupid mistake, and it was disgusting for us to find out so late. I even thought about a long time ago that there was a short discussion about the need for this application to have a few gigabytes of heap. The heap of three gigabytes is really a little big. But on the other hand, the code has been running for almost two years and nothing has happened. We can fix it relatively easily at the time, but if it was two years ago, it might take us more time, and compared to saving a few G of memory, we had a lot more important work at that time.

So, while it's a shame that this problem has been unresolved for so long from a purely technical point of view, from a strategic point of view, it may be a more pragmatic option to keep the problem of wasting memory. Another consideration, of course, is the impact of this problem once it occurs. We have almost no impact on users, but the results are likely to be worse. Software engineering is to weigh the pros and cons, and to decide the priorities of different tasks is no exception.

Or not?

With more experience using RX, we can solve ComplerableFurue the problem very simply. Rewrite the code, use RX only, upgrade to RX2 during the rewrite process, and really stream data instead of collecting them in memory. These changes are deployed to the development environment for testing after code review. To our surprise, the memory required for the application was not reduced at all. Memory sampling shows that the number of AD objects in memory has decreased compared to the previous. And the number of objects is not always growing and sometimes falling, so they are not all collected in memory. Still an old problem, it seems that the data is still not really being integrated into the stream.

So what's going on now?

The relevant keywords have just been mentioned: back pressure. When data is streamed, the speed at which producers and consumers differ is normal. If the producer is faster than the consumer and cannot slow down, it will continue to produce more and more data, and consumers cannot dispose of them at the same rate. The phenomenon is that the cache of unhandled data is growing, and that's what really happens in our app. Back pressure is a set of mechanisms that allows a slower consumer to tell faster producers to spin down.

Our index system does not have the concept of back pressure, which is not a problem before, anyway, we keep the whole index in memory. Once we have solved the problem before, the problem of the lack of back pressure becomes obvious when we start to really stream data.

This mode I have seen many times in solving performance problems: When solving a problem, you will come up with another problem that you have not even heard of, because other problems have hidden it. If your house is flooded often, you won't notice it has a fire hazard.

Fix the problem caused by the fix

In RxJava 2, the original Observable class was split into a Observable that does not support back pressure and supports back pressure flowable. Fortunately, there are some simple ways to turn a Observable that does not support back pressure into a flowable that supports back pressure out of the box. It contains resources from non-responsive sources such as Iterable create Flowable . By Flowable fusing these together, the back pressure can be generated Flowable , so as soon as a point is resolved, the system has back pressure support.

With this change, we reduced the heap from a few gigabytes to 3 GB, while keeping the system at the same speed as before. We still have a full GC that pauses for up to 2 seconds every few hours, but it's much better than the 20-second pause (and the system crash) we've seen before.

Optimize GC again

But the story is not over yet. Checking the GC logs, we noticed a lot of premature ascension, accounting for 70%. Even though performance is acceptable, we try to solve this problem and hopefully we can solve the full GC problem at the same time.

If the life cycle of an object is short, but it is still promoted to the old age, we call this phenomenon premature ascension (premature tenuring) (or premature escalation). Older objects tend to be large, using GC algorithms that are different from those of the new generation, and these prematurely lifted objects occupy the space of older generations, so they affect GC performance. Therefore, we want to try our best to avoid premature ascension.

Our application produces a large number of short life-cycle objects during indexing, so some premature ascension is normal, but should not be so severe. The first thing you can think of when applying an object that produces a lot of short life cycles is simply to increase the space of the new generation. By default, G1 GC can automatically adjust the Cenozoic space, allowing the Cenozoic to use 5% to 60% of heap memory. I've noticed that the proportions of the new generation and the old age have changed in a wide range in the running application, but I still have to modify the two parameters: -XX:G1NewSizePercent=40 and -XX:G1MaxNewSizePercent=90 See what happens. This doesn't work, and even makes things worse, and the full GC is triggered when the app is started. I also tried other proportions, but the best thing was to just increase the G1MaxNewSizePercent minimum value without modifying it. This works, probably with the default value of the performance of the same, and did not change well.

After trying a lot of things, I gave up and sent Kirk Pepperdine an e-mail, without any success. He was a well-known Java performance expert and I happened to have met him at Allegro's training course at the Devoxx conference. Kirk recommends trying the settings by looking at the GC's logs and the exchange of several emails -XX:G1MixedGCLiveThresholdPercent=100 . This setting should force the G1 GC to mixed GC without considering how much of it was filled, but rather to force cleanup of all older generations, and therefore also to clean up objects that were raised prematurely from the Cenozoic. This should prevent the old age from being filled to produce a full GC. However, after some time, we were surprised to find a full GC again. Kirk reasoned that he had seen this in other applications, and that it was a bug:mixed GC of the G1 GC that apparently did not clean up all the garbage and kept them stacked until the full GC was generated. He said he had informed Oracle about the problem, but insisted that the phenomenon we observed was not a bug, but a normal one.

Conclusion

The last thing we did was to make the application's memory a little bit bigger (from 3 GB to 4 GB), and then the full GC disappeared. We still observe a lot of premature ascension, but since performance is fine, we don't care. One option we can try is to convert to the GMS (Concurrent Mark Sweep) GC, but as it has been deprecated, we try not to use it.

So what is the moral of the story? First of all, performance issues can easily get you astray. At first it looked like a ZooKeeper or a network problem, and the last thing we found was the problem with our code. Even if I were aware of this, the first steps I took were not considered well-rounded. To prevent full GC, I started tuning the GC before checking what happened. This is a common pitfall, so remember: even if you have a gut instinct to do something, check out what's going on and check again to prevent wasting time on wrong issues.

Second, performance problems are too difficult to solve. Our code has good test coverage and is very well run, but it does not meet the performance requirements, it is not clearly defined at the beginning. Performance issues are not surfaced until long after deployment. Since it is often difficult to reproduce your production environment realistically, you are often forced to test performance in a production environment, even if that sounds very bad.

Third, solving a problem can lead to the emergence of another potential problem, forcing you to keep digging deeper than you expected. The fact that we don't have back pressure is enough to interrupt the system, but it didn't come up until we solved the problem with the memory leak.

I hope that our interesting experience will play a role in solving the performance problems you encounter.

Original link: Allegro.tech translation: Importnew.com-yizhe

Link: http://www.importnew.com/29591.html

A fun experience with debugging Java memory leaks

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.