Android System ANR caused SWT restart issue, androidanr

Source: Internet
Author: User

Android System ANR caused SWT restart issue, androidanr
I. symptom

1. What users can see is System first ANR.

2. After ANR, the system restarts.

Test method:

On the recording page, the volume progress bar is continuously swiped. At the same time, the landline calls the test machine and the phone is not connected. The page is frozen, ANR is displayed, and the system is restarted.

Platform: MT6732

Android: 4.4.4KK

BuildType: user

System software version: SWA3A + UMA0

System RAM: 1 GB

Problem Probability: ≈ 2%

Reference machine behavior:

1. Low Probability Problem, no reference machine behavior.

Ii. Solutions

Through preliminary analysis and in-depth analysis (the specific analysis process, key code, and log will be attached below), we can clearly understand the cause of the problem:

1. The setindex, getindex, and other methods related to AudioService in the SystemServer process will be performed in multiple threads, and these methods are all declared with the synchronized keyword, that is, these methods will be synchronized with multiple threads using the lock in each instance of the class.

2. These synchronized methods are interspersed with synchronized methods that call other instances of the same type.

3. Completely declaring the method with the synchronized keyword is a relatively lazy practice, which will lead to a large granularity of synchronization locks without refining the critical section, in the multi-threaded and high-concurrency state, the throughput of code stream execution is reduced, and the possibility of deadlock collision in the process of inter-dependent calls is increased.

In the current Code Execution status, there is A certain probability (very small, depending on the specific process scheduling time) that due to scheduling reasons, the first thread 1 is scheduled and the synchronization method of object A is executed, then, it is scheduled to Thread2 and the synchronization method of object B is executed. In the synchronization method of object B, it is necessary to call the synchronization method of object A. In this case, blocking occurs and Thread1 is scheduled again, continue to execute the code in the synchronization method of object A, and then call the synchronization method of object B. Because the lock of object B has been held in Thread2, Thread1 is blocked at this time, the current status is that Thread1 and Thread2 wait for each other to release the lock and wait for the lock for an indefinite period. Various code streams cannot be executed and the deadlock occurs.

Considering the complicated logic in AudioService, we need to fix this problem with minimal-risk changes. Therefore, the solution provided here has not been significantly modified, what's more, it is obvious that there are all the problematic codes with the synchronized keyword, and there is still room for optimization for AOSP.

Finally, we provide the following solutions for the root causes of the above problems:

1. Change the synchronization lock type

In the critical code area to be synchronized, use global locks of the class to replace the locks of each instance, so as to ensure that multiple threads do not experience deadlocks when they are interspersed with each other.

2. Specific Code related to the solution and backtrace

The above figure shows the backtrace call stack and corresponding code corresponding to the lock when a deadlock occurs. We can find the key call relationship and status when the problem occurs through the red line.

3. Code modification of the Final Solution

Iii. Preliminary Analysis

Taking a typical backtrace and log when ALTO4.5TMO has a problem as an example, the main thread block of SystemServer is located in a function in AudioService, which causes ANR and SWT restart, the specific backtrace is as follows:

Why block? By checking the corresponding code above, we find that this method is synchronized, and the method will traverse and call the synchronized Method of the same type but different instance objects when the conditions are met, therefore, if a block is used, one condition must be met: the synchronized Method of different instance objects of the same type cannot be called, that is, the synchronized method has been entered in other threads.

Based on this clue, continue to check the call stack of the thread related to AudioService in SystemServer and find the thread Binder_2. The specific backtrace is as follows:

Through backtrace and the corresponding code, we found that the thread Binder_2 is also block in the synchronized function of an AudioService, in the same function, when certain conditions are met, the synchronized Method of different instances of the same type is also called.

Iv. In-depth analysis of problems

After preliminary analysis, we found the first problem, that is, two different threads are all block on the same type of synchronized Method, and at the same time there is a problem, next, we will continue our in-depth analysis to find the root cause of the answers and questions.

1. Why are two threads simultaneously blocked?

By further analyzing and viewing the code, we can find that because the two threads execute the synchronized method, if they are dependent on each other due to scheduling and execution reasons, the block at the same time will occur and the deadlock will occur. Because the backtrace can only see the call relationship, it cannot know the status of each object instance at runtime, therefore, the problem status of the two threads in systemserver is simulated based on backtrace, and the results fully match the problem. The specific simulation code is as follows:

First, define a Thread class, receive two TestSync class instances, call the synchronization method of instance 1 in run, and pass instance 2 over.

Define a TestSync class, and define two member functions of synchronized. Then, sleep for 10 ms at the beginning of each function to meet the state of process scheduling switching.

Finally, the test is conducted in the onResume method of the activity. The result-tested activity will be ANR. Why is ANR?

Similar to the above systemServer ANR and SWT restart, the main UI thread of the activity and the newly created ct1 thread have a deadlock.

The execution process of the above Code is roughly as follows:

1. Create two TestSync instances t1 and t2 and CThread instance ct1, and pass t1 and t2 to the previous

2. Start the thread ct1.

3. Whether the code stream of ct1 is first scheduled for execution or the main UI thread continues to execute, the synchronized Method of t1 or T2.

4. Assume that ct1 is immediately scheduled after start and runs the synchronized Method of t1, and then sleep for 10 ms. At this time, scheduling occurs again.

5. The main UI thread is scheduled again, and then run the synchronized Method of t2, sleep for 10 ms, and then schedule again to other threads.

6. Wait until the 10 ms sleep of ct1 ends first and then schedule it to ct1 again. Then run the synchronized Method of t2. Blocking will occur here, because the synchronized Method of t2 has been entered in the main UI thread, that is, the lock of t2 instance itself is locked and then scheduled to other threads.

7. Wait until the 10 ms sleep of the UI main thread ends and schedule the task to the UI main thread again. Then run the synchronized Method of t1, which will also cause blocking, because the synchronized Method of t1 has been entered in ct1, the lock of instance t1 is locked and then scheduled to other threads.

8. At this time, the main thread of ct1 and UI has become mutually dependent and deadlocked.

Change the synchronized keyword used in the above Code to the global lock of the synchronization class. If the problem is solved, ANR will not occur in the activity. The specific change is as follows:

5. Potential impact of solutions

Because global locks of classes are used, and there is no segmentation critical section, the throughput of the Code Execution stream may be slightly reduced in the case of high concurrency, however, this impact can be ignored for methods such as setindex and getindex of AudioService in SystemServer, because these methods are lightweight and the concurrency does not reach a high level.


Analyzed by vincent. song from SWD2 Framework team.

Vincent.song@tcl.com

201506241646


Copyright Disclaimer: This article is an original article by the blogger and cannot be reproduced without the permission of the blogger.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.