Software watchdog in Android

Source: Internet
Author: User

Because Android Systemserver has an important service, there is a software-implemented watchdog mechanism within the process to monitor whether the service is working properly in Systemserver. If more than a certain amount of time (default 30 seconds), the dump site for easy analysis, and then timeout (default 60 seconds) to restart the Systemserver to ensure system availability. At the same time Logcat will print similar information:

W Watchdog: * * * Watchdog killing SYSTEM process:blocked in monitor Com.android.server.am.ActivityManagerService on Foregr Ound thread (ANDROID.FG), Blocked in Handler on Activitymanager (Activitymanager), Blocked in handler on WindowManager thr EAD (WindowManager)


The main implementation code is located in/frameworks/base/services/core/java/com/android/server/watchdog.java and/frameworks/base/core/jni/android _server_watchdog.cpp. The broad framework is simple. Watchdog is a separate thread in the Systemserver, which dispatches a check operation to each monitoring thread at a certain interval. The registered monitor object is called in this check operation. If a deadlock occurs on the monitor object, or if the critical thread is stuck, then the check must not end on time, so it is watchdog checked.


Let's take a look at the overall class diagram. Because it is unique, watchdog is implemented as Singleton. Which maintains the handlerchecker array, corresponding to the thread to be checked. The Handlerchecker array has a monitor array that corresponds to the monitor object to be checked. The object to be inspected needs to implement the Monitor interface.



Initialization starts from Systemserver's startotherservices (), and its general flow is as follows:


First, in Systemserver.java, the watchdog is created and started.

472            slog.i (TAG, "Init Watchdog") 473            final Watchdog Watchdog = Watchdog.getinstance (); 474            Watchdog.init ( context, Mactivitymanagerservice); 1120                watchdog.getinstance (). Start ();
In the watchdog constructor, a handlerchecker is created for each thread to be inspected and added to the mhandlercheckers queue. The first is fgthread. It inherits from the Servicethread, is a singleton, is responsible for those regular foreground operation, it should not be blocked by the background operation. In the Watchdog.java:
215        mmonitorchecker = new Handlerchecker (Fgthread.gethandler (), 216                "foreground thread", default_timeout); 217        Mhandlercheckers.add (Mmonitorchecker);

Next, for the main thread of system server, the UI thread, the IO thread, and the display thread, do the same thing, and this lump of thread and fgthread inherit from Servicethread. In the init () function, Registerreceiver () is then called to register the broadcastreceiver of the system restart. The Rebootrequestreceiver onreceive () function is executed when the system restarts the broadcast, and then Rebootsystem () is called to restart the system. It allows other modules, such as the CTS, to restart the system by sending a broadcast.


Then, each service that needs to be monitored by watchdog needs to register itself. They all implement the Watchdog.monitor interface, which is primarily the monitor () function. For example Activitymanagerservice:

2150        watchdog.getinstance (). Addmonitor (this); 2151        watchdog.getinstance (). Addthread (Mhandler);
where Addmonitor () puts itself in the foreground thread's Handlerchecker monitor queue, Addthread () Creates a handlerchecker based on the handler of the current thread and puts it into the mhandlercheckers queue. The implementation of monitor () is generally simple, just trying to get a lock and release lock. If there is a deadlock, it will be stuck and cannot be returned.
18767 public    Void Monitor () {18768        synchronized (this) {}18769    }

Back in Systemserver, the Watchdog thread is started by Watchdog's start () method, and Watchdog.run () is executed.


The subject of watchdog is a loop. In each iteration, all Handlerchecker's schedulechecklocked () functions are called. The main is to put the Handlerchecker object into the looper of the monitored thread, handlerchecker itself as runnable, is the executable of the thread. So when the monitored thread takes it out of the Looper, its run () function is called. Then, Watchdog.run () waits for a maximum of 30 seconds, and calls evaluatecheckercompletionlocked () to check the status of each handlerchecker result. A handlerchecker result state has four species, completed (0), Waiting (1), Waited_half (2), and overdue (3). Represents the target Looper has processed the monitor, the delay is less than 30 seconds, the delay is greater than 30 seconds less than 60 seconds, the delay is greater than 60 seconds. The final total state is their maximum value (that is, the worst case scenario). If the total state is completed,waiting or waited_half, enter the next round of the loop. Note that if it is waited_half, that is, waiting for more than 30 seconds, you need to call ams.dumpstacktraces () to dump the stack. If the status is Waited_half, it will wait for a maximum of 30 seconds after the next round of cycles.

Assuming that the thread is blocked, it has a delay of more than 60 seconds for the handlerchecker of the blocked threads, resulting in a total state of overdue. This will call Getblockedcheckerslocked () and describecheckerslocked () to print out which handler is blocking. After the information is typed in the EventLog, the current PID and the associated process PID are put into the list to be killed. Then call the Ams.dumpstacktraces () print stack as above. Wait 2 seconds and wait for StackTrace to finish writing. If necessary, dumpkernelstacktraces () will be called to kernel part of the stacktrace. Essentially reads the thread under the/proc/[pid]/task and the corresponding/proc/[tid]/stack file. Then call DOSYNCRQ () to notify kernel to print out the blocking thread information and backtrace (by writing/proc/sysrq-trigger). A dedicated thread is then created to write the information to Dropbox, which executes Ams.adderrortodropbox (). Dropbox is a set of logging systems in Android that record system error messages and retain them for a period of time to avoid being overwritten. When a crash occurs, WTF, LOWMEM, or watchdog are triggered, they are recorded via Dropbox.

425            Thread dropboxthread = new Thread ("Watchdogwritetodropbox") {426 public                    void Run () {427                        Mactivity.adderrortodropbox (428                                "watchdog", NULL, "System_server", NULL, null,429                                subject, NULL, stack, NULL) ; 430                    }431                };432            dropboxthread.start (); 433            try {434                dropboxthread.join (+)  ; Wait up to 2 seconds for it to return.435            } catch (Interruptedexception ignored) {}
Dropboxmanagerservice is added in Systemserver's startotherservices (), the default storage path for the information is/data/system/dropbox. Dropboxmanagerservice implements the Idropboxmanagerservice service interface, and the client accesses the service through Dropboxmanager. After calling Ams.adderrortodropbox () in watchdog, the function takes up the worker thread (because it involves I/O), dumps the previously obtained stack information to Dropbox, and obtains the nearest logcat. Finally, it is stored through the addtext () interface of the DBMS.

Next, if Activitycontroller is set, it calls its systemnotresponding () interface (Iactivitycontroller is the interface used for test development to monitor the behavior in AMS). Then determine whether debugger is attached and whether the restart is allowed. If the debugger is not attached and the restart is allowed, the slaughter begins.
467                SLOG.W (TAG, "* * * WATCHDOG killing SYSTEM PROCESS:" + subject); 476                SLOG.W (TAG, "* * * goodbye!"); 477                process.killprocess (Process.mypid ()); 478                system.exit (10);
Because watchdog and Systemserver are the same process, here watchdog kill himself, that is, kill Systemserver. Because it is the main process, the kill will be restarted by Init.

This is the general flow of watchdog, looking back at some of the details of Dumpstacktraces () in AMS. The PIDs in the parameter contains the process, the blocking thread, the phone process, and so on. Native_stacks_of_interest contains the following three key processes.

(Public    static final string[] Native_stacks_of_interest = new string[] {        "/system/bin/mediaserver",        "/system/bin/sdcard", "        /system/bin/surfaceflinger"    ;
Note that although they are not in Systemserver, the service in Systemserver uses binders to invoke their methods synchronously. Blocking in these processes can also cause blocking in systemserver.

In the dumpstacktraces () implementation, the trace path is first removed from the system property of Dalvik.vm.stack-trace-file, which defaults to/data/anr/traces.txt. It then creates the file (if necessary), sets the property, and finally calls the function dumpstacktraces () with the same name to complete the real dump job. Dump work will first use Fileobserver (using the inotify mechanism) to monitor when the trace file is finished. It creates a separate thread observerthread and runs. The sigquit signal is then sent for the process that was previously added to the list of dump threads. In the case of virtual machine processes, signalcatcher::handlesigquit () in art (in/art/runtime/signal_catcher.cc) is called to dump information (a DVM is similar). For the previous core service, call Debug.dumpnativebacktracetofile () to output their backtrace.

To summarize, the dumpstacktraces () process is as follows:

It can be seen that the main collection of three types of information: first, the key process (that is, the PID collected above) StackTrace, the second is a few key native services StackTrace; third, CPU utilization. One is obtained by sending Sigquit to the target process because the runtime of the Java Virtual machine captures the Sigquit signal print stack information. The principle of the second is to debuggerd the daemon to launch the application, so that it ptrace print the target process stacktrace and then sent back with a local socket. Partial implementations are located in Android_os_debug.cpp and/SYSTEM/CORE/LIBCUTILS/DEBUGGER.C. The code that initiates the request and receives the data is in the following function:

131int dump_backtrace_to_file_timeout (pid_t tid, int fd, int timeout_secs) {  sock_fd = make_dump_request ( Debugger_action_dump_backtrace, Tid, timeout_secs); 137/  * Write The data read from the socket to the FD. */... 141  while ((n = temp_failure_retry (read (sock_fd, buffer, sizeof (buffer)))) > 0) {142    if (Temp_failure_retry ( Write (fd, buffer, n))! = N) {143      result = -1;144      break;145    }146  } ...
Finally, use the Processcputracker class to measure CPU usage. It is mainly by reading the system's/proc/[pid]/stat file. You can read the elapsed time of the process (user mode and kernel mode). After a half-second statistic, the sorted output accounts for the first few stacktrace of the CPU in order to analyze who might be the culprit.

In all, watchdog is a software-implemented mechanism that detects systemserver in-process deadlocks or hangs and is able to recover from it. In addition to watchdog, Android also has some self-test fault-tolerant and error information collection mechanism, the former has anr,oom killer,init in the restart mechanism, the latter has dropbox,debuggerd,eventlog,bugreport and so on. In addition, other information viewing and debugging commands are countless, such as Dumpsys, Dumpstate, Showslab, Procrank, Procmem, Latencytop, Librank, Schedtop, SVC, AM, WM, ATR Ace, Proc, PM, service, Getprop/setprop, logwrapper, input, getevent/sendevent, etc. Making full use of these tools can effectively improve the efficiency of analytical problems.

Software watchdog in Android

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.