Record the troubleshooting and resolution of a deadlock, and record the troubleshooting

Source: Internet
Author: User

Record the troubleshooting and resolution of a deadlock, and record the troubleshooting
This is a sad reminder. I remember last week's busy schedule. The goal is to modify some bugs in the code left by another team. This project is used for OLAP analysis, the Logic Server is mainly responsible for metadata operations, such as information displayed on the page, and the Analysis Server is responsible for executing query statements, because they used mondrian as an OLAP analysis engine, the input query is an MDX statement and the result is a two-dimensional data. This is the basic background of the project. Of course, it is found that mondrian is really slow. Mondrian has another problem. It does implement some caching internally. The Caching seems to be cell-based, but all of its caching is stored in the process, as a result, each Analysis Server is stateful and cannot be expanded to multiple ones. Otherwise, these caches cannot be used. In addition, because we need to support a large number of data sources (each product may have one or more data sources), each data source may define multiple reports, and each report corresponds to an MDX query statement, this results in a large amount of cached data, which can easily cause OOM. Therefore, our next task is to remove the cache and put it in a third-party cache system. Back to the topic, just busy preparing to go online on Friday. After the launch, I rushed into the user group and yelled at it without much verification, suddenly, the boss came up and said, why is it so slow to open a report? I checked it and found it really slow. Why didn't I find it in the test environment? After multiple verifications, I began to doubt what I might have corrected, and immediately rolled back to the previous version. Then I had to find out what was wrong. Fortunately, it is easy to reproduce this phenomenon by switching the online environment to the test environment. The main reason is to open a report and load it for a period of time, after opening the report, it will be much faster, because the subsequent operations are obtained from the cache, however, when I click "Clear cache" on the page (this operation clears the cache of the entire report and the internal cache of mondrian in real time), it will wait a long time before returning, then this operation is asynchronous, And I can perform other operations on the page. However, when I click "Clear cache" for other reports again, the Operation will slow down, and it may take some time to open other reports. The problem can be easily reproduced. I have no experience in troubleshooting java, but I also know tools such as jstack and jmap, therefore, jstack was immediately used to capture the entire process's stack (it was a pity that jstack was not executed before rollback), and it was found that there was indeed a problem:

Found one Java-level deadlock:============================="mondrian.rolap.RolapResultShepherd$executor_160":  waiting to lock monitor 0x0000000043b2bf70 (object 0x0000000702080db0, a mondrian.rolap.MemberCacheHelper),  which is held by "mondrian.rolap.RolapResultShepherd$executor_152""mondrian.rolap.RolapResultShepherd$executor_152":  waiting to lock monitor 0x00007f4e6c0751c8 (object 0x0000000702081270, a mondrian.rolap.MemberCacheHelper),  which is held by "http-8182-11""http-8182-11":  waiting to lock monitor 0x0000000043b2bf70 (object 0x0000000702080db0, a mondrian.rolap.MemberCacheHelper),  which is held by "mondrian.rolap.RolapResultShepherd$executor_152"

This means that there is a deadlock in the program. Three threads are involved here, but both of them hold a lock and want to lock the lock held by the other side, the third thread is waiting for the lock held by a thread in the first two threads. With this stack, it is easy to troubleshoot the problem, the stack information shows that many threads are waiting for the locks held by these two threads, but because these two threads are already in a deadlock status, other threads can only wait for synchronization, in this way, operations on these reports on the front end will consume the threads in tomcat sooner or later. Find the corresponding code based on the stack. The Code executes the cache cleaning operation, but the cache is created for hierarchy under each cube. Therefore, the call information in the stack is as follows:
at mondrian.rolap.SmartMemberReader.flushCacheSimple(SmartMemberReader.java:577)        - waiting to lock <0x00000007020a8990> (a mondrian.rolap.MemberCacheHelper)        at mondrian.rolap.RolapCubeHierarchy$CacheRolapCubeHierarchyMemberReader.flushCacheSimple(RolapCubeHierarchy.java:883)        at mondrian.rolap.RolapCubeHierarchy.flushCacheSimple(RolapCubeHierarchy.java:458)        at mondrian.rolap.MemberCacheHelper.flushCache(MemberCacheHelper.java:166)        - locked <0x00000007020a8e50> (a mondrian.rolap.MemberCacheHelper)        at mondrian.rolap.RolapCubeHierarchy$CacheRolapCubeHierarchyMemberReader.flushCache(RolapCubeHierarchy.java:878)        at mondrian.rolap.RolapCubeHierarchy.flushCache(RolapCubeHierarchy.java:451)

The first flushCache function is hierarchy-level cache cleaning. It actually calls the member variable reader object's member ache method. This reader is used to read members under hierarchy, it can be read directly from the data source (Relational Database Service), and the members cache is maintained. Therefore, the method of calling the mongoache method of reader is the method of calling its cache object. The cache Object Name Is rolapCubeCacheHelper, the type is MemberCacheHelper, but the specific operations performed by the sort ache method in reader are as follows:
         @Override        public void flushCache(){            super.flushCache();            rolapCubeCacheHelper.flushCache();        }

First, call the flushCache method of the parent class. What is the ghost of the parent class? Open the flushCache method of the parent class and find something strange:
     public void flushCache(){      synchronized( cacheHelper){            cacheHelper .flushCache();     }    }

This is the flushCache method of the parent class. It actually locks the cacheHelper object of the member variable, and then uses the flushCache method of cacheHelper to open the cacheHelper object to find that it is another MemberCacheHelper object, why does the parent class and subclass store a MemberCacheHelper object? In fact, the MemberCacheHelper object is a cache structure. The parent class has some public cache data, and the subclass has its own cache information. In this way, it can also be said to continue to the MemberCacheHelper class flushCache method:
     // Must sync here because we want the three maps to be modified together.    public synchronized void flushCache() {        mapMemberToChildren.clear();        mapKeyToMember.clear();        mapLevelToMembers.clear();                if (rolapHierarchy instanceof RolapCubeHierarchy){             ((RolapCubeHierarchy)rolapHierarchy ).flushCacheSimple();        }        // We also need to clear the approxRowCount of each level.        for (Level level : rolapHierarchy.getLevels()) {            ((RolapLevel)level ).setApproxRowCount(Integer. MIN_VALUE);        }    }

Here, we will clear every map in the cache and then execute the flushCacheSimple Method on this hierarchy. How can we get it back, isn't this hierarchy object the hierarchy object that we access and enter flushCache? After the flushCacheSimple method, it is found that it finally calls the flushCacheSimple method of reader. The operation executed by this function is similar to flushCache:
     public void flushCacheSimple(){           super.flushCacheSimple();           rolapCubeCacheHelper.flushCacheSimple();        }

Okay. Continue to the flushCacheSimple method of MemberCacheHelper:
    public void flushCacheSimple(){     synchronized(cacheHelper){           cacheHelper.flushCacheSimple();     }    }

I tried to clean it and lock it again. Didn't I have used it before? Of course, this lock should be reentrant. here it will not lead to deadlocks, but the following rolapCubeCacheHelper object is also a MemberCacheHelper object! Finally, enter the flushCacheSimple method, which is completely messy:
    public synchronized void flushCacheSimple() {        mapMemberToChildren.clear();        mapKeyToMember.clear();        mapLevelToMembers.clear();        // We also need to clear the approxRowCount of each level.        for (Level level : rolapHierarchy.getLevels()) {            ((RolapLevel)level).setApproxRowCount(Integer.MIN_VALUE);        }    }

Is the operation executed here different from the flushCache method ?! What is this? Of course, after so many reasons, we also found the root cause of the deadlock, that is, the flushCache method executed by reader, the flushCache of the cacheHelper object of the parent class and the current class is called, but this method also calls the flushCacheSimple method. This method calls the flushCacheSimple method of reader again, here we call the flushCacheSimple method of the parent class and the cacheHelper object of the current class again, and each call requires A lock, which leads to the following deadlock: thread A executes the flushCache method, it has completed super. the flushCache method then executes the flushCache method of the current reader object. First, you need to hold the lock of the helper Object in time, and then apply for the helper Object lock of the parent class when executing the flushCacheSimple method. The line B may be executing super. when flushCache enters this function, it means that it needs to hold the helper of the parent class, but when it executes flushCacheSimple, it needs to apply for the lock of the helper Object of the current class, which causes a deadlock.
I didn't know how the deadlock was caused before I got to the problem. So I thought that it would be possible to avoid deadlocks by executing the flushCache method sequentially in all threads (no concurrency ), however, I tried to find that this is not the case, because other threads may still call the flushCache method, which is not controlled by me, so I can only know exactly what the function is executed, I found that the flushCache and flushCacheSimple methods are actually repeated. I don't know what the people who wrote the code thought, so I removed all calls to the flushCacheSimple method, in this way, you will no longer apply for A block when you hold the lock.
The problem is solved. In the end, the hotfix version is launched, and a suspended mind is put down. I also learned a lot in this process: 1. Learn to use the analysis tools provided by java, such as jstack, jstat, jmap, and open-source MAT. 2. Don't be afraid when you encounter problems. Don't blame me for this problem because of me. I don't know what's going on and so on. Calm down and think about the entire process, using previous theoretical knowledge and experience, we can solve the problem by accident. If an error occurs, it must be caused by a code problem. 3. Testing is very important, especially stress testing. Our project is currently in short supply, and QA is not dedicated. Therefore, we basically develop and test functions in the development environment, I have not done any performance tests. I think the tests should cover all possible online situations as much as possible. 4. Roll Back before going online. Otherwise, you will be very embarrassed. Fortunately, I will back up the data before each operation. 5. When coding, especially when an operation involves multiple synchronized operations, pay special attention to the several methods to avoid deadlocks, locking in order is often the best solution. 6. Find out what a method wants to do? What is input, what is output, and what is the impact? after writing a method, simulate in the mind whether the execution process of the entire function is as expected. 7. If such a requirement is met: both the parent class and Child class hold an object of the type and allow them to operate independently. After the operation of the parent class object is completed, the Child class object is operated, instead of calling them.
Mondrian will be started in the next period of time. I hope I can learn something from this OLAP execution engine, but I have almost no knowledge about the compilation principles. I need to reinforce this, what I should focus on in mondrian should be: 1. How to parse MDX (similar to how to parse SQL ), 2. How to dynamically translate MDX into a series of SQL statements (similar to how to generate an execution plan), 3. How to Implement the cache, and 4. How to Use the cache when executing MDX or SQL statements, 5. Optimize an aggregate table. Hope it works well ~



Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.