Android Stability test occured SWT restart issue, androidswt
I. symptom
1. System first ANR.
2. After ANR, the system restarts.
Test method:
Stability test.
Platform: MT6732
Android: 4.4.4KK
BuildType: user
System software version: D17 + ZX
System RAM: 1 GB
Problem Probability: no statistics are collected. As of now, only one query has been performed.
Reference machine behavior:
1. Low Probability Problem, no reference machine behavior.
Ii. Solutions
Through preliminary analysis and in-depth analysis (the specific analysis process, key code, and log will be attached below), we can clearly understand the cause of the problem:
1. The main thread of the SystemServer process is blocked when it accesses the media audio interface for cross-Process calling.
2. The Media audio interface is eventually called to the mediaserver process.
3. A deadlock occurs in the Mediaserver process.
4. The system server first ANR and then is restarted by SWT.
There is a certain probability in the execution status of the current Code (very small, must meet the conditions after the sp pointer is obtained and become the last strong reference) A deadlock occurs when a mutex autolock method with the default NORMAL type is called recursively.
Considering the complex logic in mediaserver, we need to fix this problem with minimal-risk changes. Therefore, the solution provided here has not been significantly modified.
Finally, we provide the following solutions for the root causes of the above problems:
1. extend the life cycle of the sp pointer to make it longer than mutex autolock
In the mutex autolock function that may be recursively called, the vector container is used to extend the lifecycle of the sp pointer so that it can analyze the sp pointer after the mutex autolock lifecycle ends and unlock, in this case, recursive calls will not cause deadlocks.
2. Specific Code related to the solution and backtrace
The above figure shows the backtrace call stack and corresponding code corresponding to the lock when a deadlock occurs. We can find the key call relationship and status when the problem occurs through the red line.
3. Code modification of the Final Solution
Iii. Preliminary Analysis
Taking a typical backtrace and log in case of ALTO4.5TF failure as an example, the main thread block of SystemServer is located in an native function in the AudioSystem, which causes ANR and SWT restart, the specific backtrace is as follows:
Why block? By checking the corresponding code above, it is found that the setParameter method calls the setParameter method of audioflinger in mediaserver through the proxy object of audioflinger across processes. Therefore, it is necessary to check whether the mediaserver process works properly.
Based on this clue, we can continue to check the call stacks of various threads in mediaserver and find that almost all Binder threads are blocked, this explains why ANR and SWT restart occur when the SystemServer call is not processed. The backtrace of most Binder threads that are block is as follows:
Since mediaserver is a native C/C ++ code, you need to use address2line to find the corresponding PC pointer address in symbols for source code analysis. The specific process will not be repeated, you can find some special articles for simple learning. After this process, the corresponding source code is as follows:
From the above code, we can see that the first code in the implementation of getplayertype is to implement sLock. lock (). If the lock fails, it will wait, so we will continue to check who occupies the lock in other threads. By checking the backtrace of other threads, we can find that the Binder_5 thread occupies sLock, but it is also blocked on another mutex lock. The specific backtrace and Code are as follows:
Following this clue, we will continue to look at other threads in the mediaserver process, to see who occupies the ALooperRoster mLock, and finally find that the rtsp thread occupies the mLock when calling the ALooperRoster's unregisterStaleHandlers function, at the same time, it indirectly calls unregisterStaleHandlers and is finally blocked on this mLock. The specific backtrace and Code are as follows:
Iv. In-depth analysis of problems
After preliminary analysis, we locate the first problem, that is, the rtsp thread block is located on an mLock that has been locked, and two problems are also generated, next, we will continue our in-depth analysis to find the root cause of the answers and questions.
1. Why indirectly and recursively call the ALooperRoster: unregisterStaleHandlers method?
2. Why does the rtsp thread block itself on mLock?
Through further analysis and code viewing, it is found that the rstp thread indirectly and recursively calls the ALooperRoster: unregisterStaleHandlers method because sp <alow.> logoff = info. mlogoff. promote (); in this Code, the logoff strong pointer sp in the case of a problem, the scope of a for loop becomes the last sp holding the alogue object, and the release of the alogue object occurs in the analysis, which leads to the structure of the alogue object, the ALooperRoster: unregisterStaleHandlers method will be called again in the ALooperRoster object structure to generate indirect recursive calls. Here is a brief introduction to the smart pointers, sp, and wp of C ++ on the native layer in Android, the general principle is to use the constructor and destructor of the sp and wp pointer objects to add and subtract the reference count of the objects to which the Pointer Points. The objects to which the pointer must be inherited from class RefBase, release an object based on its lifecycle marker when the strong reference count or weak reference count is 0:
The default lifecycle is STRONG. That is, when the STRONG reference count of an object is 0, the object is released. The key code is as follows:
After answering the first question, let's look at the second question. Before answering the second question above, We Need To Briefly Talk About the Autolock class, this class is in system/core/include/utils/Mutex. the following code defines the Mutex class in h:
The principle of this class is simply to lock in the constructor and unlock in the destructor, which takes advantage of the features of C ++, so the scope of the Autolock object will lock, when out of scope, it will unlock to implement automatic Mutex. The Mutex class is an encapsulation of pthread_mutex_t, while pthread_mutex_t is of the NORMAL type by default, that is, it cannot be recursively re-imported:
The ALooperRoster: unregisterStaleHandlers method uses Mutex: Autolock autoLock (mLock) for synchronization, and uses recursive calls when there is no scope of Autolock, at this time, the Mutex of mLock is locked, because the default non-recursive re-entry attribute of Mutex eventually leads to the current deadlock, this explains why the rtsp thread blocks itself on mLock.
5. Potential impact of solutions
Since the lifecycle of the sp pointer is only slightly delayed, it will be released immediately after the function scope is available. Therefore, the current solution has no other known impact, finally, let's take a look at the modified Code and comments:
Analyzed by vincent. song from SWD2 Framework team.
Vincent.song@tcl.com
201506242052
Copyright Disclaimer: This article is an original article by the blogger and cannot be reproduced without the permission of the blogger.