Watchdog of the system process

Source: Internet
Author: User
Tags abstract definition rounds thread class

written by: Li Wendong/rayleeya

http://rayleeya.iteye.com/blog/1963408

3.1 Watchdog introduction

For a pure soft programmer like the author who has not played hardware, the first time I saw this guy, I really confused, but I thought it was a very interesting name. A survey found that the watchdog mechanism originated from the hardware, in the computer system, the single-chip microcomputer work is susceptible to interference from the external electromagnetic field, and into the dead loop, the system can not continue to work, in order to solve this problem, it has produced a special monitoring of the single-chip program running state of the chip, commonly known as " Watchdog "(Watchdog).

The "watchdog" itself is a timer circuit, and the internal timing (or counting) operation is constant. The computer system and the "watchdog" have two pins connected to the "watchdog" signal at regular intervals through one of the pins, and when the "watchdog" receives the signal, it cleans the timer 0 and restarts the chronograph. And once the system has a problem, into the dead loop or any blocking state, can not send a signal in time to let "watchdog" Timer Zero, when the time is over, "watchdog" will send a "reset signal" to the system through another pin, let the system restart.

In this way, the "watchdog" sends a signal like "feed the Dog", the timer is the "watchdog" stomach, when the time is over, the dog is hungry, one bite to kill the system, let it reborn.

The idea and influence of watchdog technology on software is similar, for example, Linux comes with watchdog. Let's take a look at the puppy of the Android system process.

3.2 Watchdog of the system process

Watchdog (hereinafter referred to as WD) in the process of Android system is used to monitor the system process, what is the difference between it and WD on hardware? The system process maintains a large number of service objects, some of which are very important objects, such as Activitymanagerservice, Windowmanagerservice, etc., these service objects can be normal access to the operation of the system is critical, They are the key objects for the moment. These key objects may be used by multiple threads at the same time, so they need to be protected by using synchronous locks where they are manipulated to ensure consistent object state. However, if a thread does not release the lock for a long time after it locks the key object, and the other thread cannot use the object to complete the subsequent task, the system is stuck and will not run, and the system needs to be restarted to restore to a normal running state. Detecting whether these key service objects are locked, and rebooting the system is done by WD.

How did WD accomplish this sacred mission? Let's take a look at WD's creation and startup, and then dissect its structure and process.

3.2.1 Watchdog initialization and startup

The WD object is a singleton that was created during the system startup process in the Serverthread thread.

Serverthread.java→run ()

Slog. I (TAG, "Init Watchdog");

Watchdog. getinstance (). Init (context, battery, power, alarm,

Activitymanagerservice. Self ());

It is important to note that the WD construction method internally creates an object of type Heartbeathandler, which is described in detail later, but it is certain that the handler instance is bound to the looper of the serverthread thread.

Watchdog.java→watchdog ()

Private Watchdog () {

Super ("Watchdog");

Mhandler = new heartbeathandler ();

}

Take a look at what the Init method does specifically.

Watchdog.java→init ()

Public void init (context context, batteryservice battery,

Powermanagerservice Power, Alarmmanagerservice alarm,

Activitymanagerservice activity) {

Saved a few service objects to use

Mresolver = Context.getcontentresolver ();

Mbattery = battery;

MPower = power;

Malarm = alarm;

Mactivity = activity;

Register two broadcastreceiver to receive a reboot message

Context.registerreceiver (new rebootreceiver (),

New Intentfilter (reboot_action));

Mrebootintent = pendingintent. Getbroadcast (Context,

0, new Intent (reboot_action), 0);

Context.registerreceiver (new rebootrequestreceiver (),

New Intentfilter (Intent. Action_reboot),

Android. Manifest.permission.REBOOT, null);

Mboottime = System. Currenttimemillis (); Record start time

}

The Init method is simple, and the content involved is described later.

WD has not started running after initialization is complete. Take a look at the WD class declaration to know that it is a subclass of the thread class, it can be thought that WD is in its own thread to implement the "timer" function, it is reasonable, to achieve similar to the completion of independent hardware timing work, with independent threads to complete the monitoring of the process is no better. Starting the WD thread is done in the final phase of the Serverthread thread.

Activitymanagerservice. Self (). Systemready (new Runnable () {

Public void Run () {

. .. ...

Watchdog. getinstance (). Start ();

... ...

}

);

Structure analysis of 3.2.2 watchdog

In order to describe the convenience, first give the WD class diagram.



The structure of WD is relatively simple, and we will analyze it in several parts below.

1. Watchdog

WD's "Timer" function is done in a separate thread, so WD itself inherits the thread, as previously stated, it was started in the Activitymanagerservice Systemready method.

2. Heartbeathandler

The Android system process of WD and the hardware of the "watchdog" idea is consistent, but the implementation of different ways. The WD thread is not passively waiting for the system to "feed the dog" signal in the process of timing, but at the beginning of each round of timing to the Serverthread (hereinafter referred to as St) line Cheng A detection message, ST received the message and began to traverse the monitor object collection, try to get each object's lock, This message detection process is implemented in Heartbeathandler (hereinafter referred to as "HH"). It is important to note that HH binds the St thread, and St acts as the main thread of the system process to perform the detection operation.

3. Monitor and monitored service objects

WD monitors a number of key service objects in the system process, with an abstract definition of such objects, with the monitor interface, which has only one monitor method. WD monitors objects that implement this interface, which has a collection of monitor objects in it, and any object that implements the monitor interface and is registered to the collection via WD's Addmonitor method can be monitored.

Before gingerbread, the monitored service object is only <!--confirm if it is starting from 2.3--Activitymanagerservice, Windowmanagerservice and Powermanagerservice, after this, added 4, respectively, Networkmanagementservice, Mountservice, Nativedaemonconnector and InputManager.

The implementation of the monitor interface is simple, such as the implementation of AMS:

Activitymanagerservice.java→monitor ()

Public void Monitor () {

synchronized (this) { }

}

You can see the so-called "Monitoring service object", the white is the deadlock detection of these objects, if you can successfully obtain the lock of the monitored objects that the system is operating normally, if not obtained for a long time, the system is considered to be in a state of stagnation, need to take action.

4. Rebootreceiver and Rebootrequestreceiver



3.2.3 Watchdog work Flow

WD's workflow is primarily the interaction between the WD thread and the HH thread, first from the WD Run method.

Watchdog.java→run ()

Public void Run () {

boolean waitedhalf = false;//is the record waiting half

while (true) {

mcompleted = false;//Use a Boolean variable to mark if deadlock detection is complete

? sends the heartbeat to HH, which is executed in the ST thread

Mhandler.sendemptymessage (MONITOR);

synchronized (this) {

When the heartbeat is sent, it waits for the detection operation to complete and the wait time is 30 seconds in normal operation.

long timeout = time_to_wait;

long start = Systemclock. Uptimemillis ();

while (Timeout > 0 &&!mforcekillsystem) {

Try {

Wait (timeout);

... ...

}

. .. ...

Through the above code can be known that the monitoring process is a dead loop, each cycle will do a round of deadlock detection. There are two points to note, which are described below:

? WD's "timer" for each round of detection time-out is 30 seconds, but after 30 seconds timeout, WD will not restart the system immediately, but instead of waitedhalf set to true, think that only half of the time, that is, WD would like to give the locked object a chance to do two rounds of detection, It's not too late to kill if it still times out. WD is also very human, and you'll see when Waitedhalf is set to true later.

? Each round of the detection operation is not done by the WD thread itself, but instead sends a message to HH, which is done by the St thread bound by HH. How is it done? Next go to HH to explore.

Final class Heartbeathandler extends Handler {

... ...

Case MONITOR: {

... ...

Final int size = mmonitors.size ();

for (int i = 0; i < size; i++) {

It is important to record the monitor object that is currently being detected.

Mcurrentmonitor = Mmonitors.get (i);

Mcurrentmonitor.monitor ();//Detection of deadlocks

}

synchronized (Watchdog. This) {

mcompleted = true;//Mark detection complete, WD thread will use this to determine whether to complete

Mcurrentmonitor = null;

... ...

Logic is very simple, not much explanation, just notice that HH does not wake the WD thread with Notifyall after the deadlock detection, so the WD thread normally continues the next round of detection after the timeout. HH is more lazy.

Next go back to the WD thread.

...//wd thread end wait waiting

if (mcompleted &&!mforcekillsystem) {

If the detection is successful, reset the WAITEDHALF flag to continue the next round of detection

Waitedhalf = false;

continue;

}

if (!waitedhalf) {

Execution here, the detection process is blocked, not completed, and Waitedhalf is false, indicating

This is the first round of detection of a deadlock detection failure. Function call stacks for each thread in the system process via AMS

Output to the/data/anr/traces.txt file, as well as the output of several important native processes

BackTrace, in order to provide more information to locate the problem, because the Java layer of blocking is likely to be native

caused by the blockage of the layer.

arraylist<integer> PIDs = new arraylist<integer> ();

Pids.add (Process. Mypid());

Activitymanagerservice. dumpstacktraces (true, PIDs, null, null,

native_stacks_of_interest);

Waitedhalf = true;//set to True to indicate that the WD thread has waited a round

continue;//One more round, give a chance

}

If the next second-round deadlock detection still fails, the above code will not be executed and continue to go down.

Since then, the various types of log information that have been output can be conveniently analyzed for deadlock reasons

Final String name = (Mcurrentmonitor! = null)?

Mcurrentmonitor.getclass (). GetName (): "NULL";

? Record the object that is performing deadlock detection

EventLog. writeEvent (Eventlogtags. WATCHDOG, name);

Again output the function stack information of the system process

arraylist<integer> PIDs = new arraylist<integer> ();

Pids.add (Process. Mypid());

Also output the function stack of the com.android.phone process, because the phone system is the most important for the mobile phone

module, naturally to focus on

if (Mphonepid > 0) pids.add (mphonepid);

Final File stack = activitymanagerservice. dumpstacktraces (

!waitedhalf, PIDs, null, null, native_stacks_of_interest);

... ...

if (record_kernel_threads) {

Dumpkernelstacktraces ();//output part of kernel information to help locate the problem

}

... ...

? output to Dropbox in a sub-thread

Mactivity.adderrortodropbox (

"Watchdog", null, "System_server", null, null,

Name, null, stack, null);

... ...

Exit the process directly if the debugger does not have a link

if (! Debug. isdebuggerconnected ()) {

Slog. W (TAG, "* * * * WATCHDOG killing SYSTEM PROCESS:" + name);

Process. killprocess (Process. Mypid ());

System. Exit (10);

} Else {//If you are debugging, you can debug the breakpoint

... ...

There are three key points to be aware of:

? Records the object that is executing the deadlock detection, and if name is "null", it is actually equivalent to the Handlemessage method that the Serverthread thread has not executed hh, which is blocked elsewhere. It is important to note, however, that if the trace information is analyzed without blocking because of the critical service object being detected, then you need to look at the function call stack of the serverthread thread to determine the true blocking cause.

? There are two rounds of deadlock detection before the system restart, the first round will create a new Traces.txt file, but the second round will be on the basis of the original file, so you will see in the trace information two times the system process various threads of the function call stack information.

? Output to Dropbox in a sub-thread, so if the deadlock information saved in Traces.txt is overwritten before it is available, you can find a backup of this log in the/data/system/dropbox directory.

The realization of watchdog is actually to build a message loop in a thread, communicating between threads via message and member variables, which is the same as the nature of the handler mechanism.

At this point, WD's workflow is finished, it is relatively simple, at the end of this chapter is attached to the WD flowchart. The next step is to introduce some of the analysis methods that WD has detected to cause a restart after a deadlock, and for Android systems engineers, it's certainly a common practice to deal with such problems.

3.3 Watchdog-induced Restart Problem analysis method 3.3.1 Monitored object deadlock

adjourned

3.3.2 Serverthread Thread Blocking

adjourned

Flowchart of watchdog


Watchdog of the system process

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.