"Problem phenomenon"
Launch "God Horse Search" APP, System high probability restart.
"Parsing problems" in the main log, except for the app's NE log zygote Restart log, there are no other obvious exceptions:
11-05 15:14:51.824 11179 11179 I debug:pid:23631, tid:23693, Name:ucsdk_setup >>> com.uc.searchbox <&L t;<11-05 15:14:51.824 11179 11179 I debug:signal 6 (SIGABRT), code-6 (Si_tkill), fault addr--------11-05 15:14:51 .914 11179 11179 i debug:r0 00000000 R1 00005c8d R2 00000006 R3 0000000011-05 15:14:51.914 11179 11179 i DEBUG : R4 00000006 R5 00000016 R6 00005c8d R7 0000010c11-05 15:14:51.914 11179 11179 I debug:r8 7b6aa841 R9 0 0000000 SL 4173fc38 FP 71be6c9011-05 15:14:51.914 11179 11179 I debug:ip 7c27cf8c sp 7a0da770 lr 400b5169 pc 400c410c CPSR 000f001 ... 11-05 15:14:54.224 24116 24116 d androidruntime:11-05 15:14:54.224 24116 24116 d androidruntime: >>>>>> Androidruntime START com.android.internal.os.ZygoteInit <<<<<<11-05 15:14:55.054 232 232 I Servicem Anager:service ' simphonebook.0 ' died11-05 15:14:55.054 232 232 I servicemanager:service ' Simphonebook ' died11-05 15:14:55.054 232 232 I Servicemanager:service ' iphonesubinfo.0 ' died
There is no exception in the kernel log.
View Zygote with Stace:
[email protected]:/# PS |grep zygote root 18987 1 863628 607 FFFFFFFF 400a9854 S zygote [email protected]:/# Strace-ctttip 18987 15:14:53.566312 [400c2648] Madvise (0x71aac0 XX, 16384, 0xc/* madv_??? */) =-1 EINVAL (Invalid argument) <0.000037>15:14:53.566502 [400c25a8] Mmap2 (NULL, 1040384, prot_read| Prot_write, map_private| map_anonymous| Map_noreserve,-1, 0) = 0x71ab0000 <0.000041>15:14:53.566696 [400c2648] Madvise (0x71ab0000, 1040384, 0xc/* MADV_?? ? */) =-1 EINVAL (Invalid argument) <0.000037>15:14:53.566942 [400c2628] Mprotect (0x71ab0000, 4096, prot_none) = 0 &L t;0.000045>15:14:53.567160 [400c39d8] Clone (CHILD_STACK=0X71BADDC8, flags=clone_vm| clone_fs| clone_files| clone_sighand| clone_thread| CLONE_SYSVSEM) = 24057 <0.000065>15:14:53.567381 [400c3a84] Futex (0x71baddd0, futex_wake_private, 1) = 1 < 0.000083>15:14:53.567671 [400c3a84] Futex (0x4174612c, Futex_wake_private, 2147483647) = 1 <0.000052>15:14:53.567882 [400c3a84] Futex (0x41746128, futex_wake_private, 1) = 1 <0.000081> 15:14:53.568118 [400c3a84] Futex (0x6fccbdf0, futex_wake_private, 1) = 1 <0.000063>15:14:53.568346 [400c211c] Getpgid (0X4B30) = 18987 <0.000039>15:14:53.568542 [400c231c] Setpgid (24049, 18987) = 0 <0.000039> 15:14:53.568768 [400c3538] sendmsg (+ msg_name (0) =null, Msg_iov (1) =[{"\0\0"? ", 4}], msg_controllen=0, msg_flags=0}, msg_nosignal) = 4 <0.000061>15:14:53.569092 [400c3538] sendmsg (, {msg_name (0) =null, Msg_iov (1) =[{"+", 1}], MSG _controllen=0, msg_flags=0}, msg_nosignal) = 1 <0.000053>15:14:53.569426 [400c2854] Select (Wuyi, [ten], NULL, NULL, NULL) =? Erestartnohand (to be restarted) <0.328044>15:14:53.905148 [????????] + + killed by SIGHUP + + + +
Zygote turned out to be sighup to kill!
Zygote is the root permission, the general app cannot send signal to kill the root app in the user state, so the scope of the problem is reduced to the kernel code.
Searching for sighup in kernel code, the following code is suspicious:
@kernel/kernel/exit.cStatic voidKill_orphaned_pgrp (structTask_struct *tsk,structTask_struct *parent) { structPID *PGRP =task_pgrp (TSK); structTask_struct *ignored_task =tsk; if(!parent) Parent= tsk->real_parent; ElseIgnored_task=NULL; if(TASK_PGRP (parent)! = Pgrp &&task_session (parent)= = Task_session (tsk) &&will_become_orphaned_pgrp (PGRP, Ignored_task)&&has_stopped_jobs (PGRP)) { __kill_pgrp_info (SIGHUP, Send_sig_priv, PGRP); __kill_pgrp_info (Sigcont, Send_sig_priv, pgrp); }}
Open the Kernel tracing switch
# echo 1 >/d/tracing/events/signal/enable# cat/d/tracing/trace_pipe <...>-9083 [003] D. 3 88396.530478:signal_generate:sig=1 errno=0 code=128 comm=chmod pid=9083 grp=1 res=0 <...>-9083 [003] D.. 3 88396.530567:signal_generate:sig=1 errno=0 code=128 comm=ipaygphone:push pid=9024 grp=1 res=0 <...>-90 [003] D. 3 88396.530728:signal_generate:sig=1 errno=0 code=128 comm=m.taobao.taobao pid=8889 grp=1 res=0 <...>-90 [003] D. 3 88396.530891:signal_generate:sig=1 errno=0 code=128 comm=id. Alipaygphone pid=8824 grp=1 res=0 <...>-9083 [003] D.. 3 88396.531072:signal_generate:sig=1 errno=0 code=128 comm=.searchbox:push pid=8752 grp=1 res=0 ... <...>-9083 [003] dn.3 88396.535071:signal_generate:sig=1 errno=0 code=128 comm=zygote pid=11158 grp=1 res=0 <...>-9083 [003] dn.3 88396.535073:signal_generate:sig=18 errno=0 code=128 comm=chmod pid=9083 grp=1 REs=1 <...>-9083 [003] dn.3 88396.535075:signal_generate:sig=18 errno=0 code=128 comm=ipayGphone:push Pi d=9024 grp=1 Res=1 <...>-9083 [003] dn.3 88396.535077:signal_generate:sig=18 errno=0 code=128 comm=m.ta Obao.taobao pid=8889 grp=1 Res=1 <...>-9083 [003] dn.3 88396.535079:signal_generate:sig=18 errno=0 Code =128 Comm=id. Alipaygphone pid=8824 grp=1 Res=1 <...>-9083 [003] dn.3 88396.535081:signal_generate:sig=18 errno=0 cod e=128 comm=.searchbox:push pid=8752 grp=1 res=1 ... <...>-9083 [003] dn.3 88396.535363:signal_ Generate:sig=18 errno=0 code=128 comm=com.miui.core pid=11542 grp=1 res=1 <...>-9083 [003] dN.3 88396.53 5365:signal_generate:sig=18 errno=0 code=128 comm=ndroid.systemui pid=11522 grp=1 res=1 <...>-9083 [003 ] dn.3 88396.535368:signal_generate:sig=18 errno=0 code=128 comm=system_server pid=11430 grp=1 Res=1 <...&G t;-9083 [0Dn.3 88396.535370:signal_generate:sig=18 errno=0 code=128 comm=zygote pid=11158 grp=1 Res=1
As can be seen clearly from log, this 9083 thread sends Sighup (1) and Sigcont (18) to all processes including zygote.
This can further determine the problem point may be this kill_orphaned_pgrp (),
This function is called when the process exits the process and finds itself an "orphan process".
The next step is to monitor system calls such as fork, exec, exit, and so on.
Open the fork, exec, exit tracing switch and grab tracing log:
# echo 1 >/d/tracing/events/sched/sched_process_fork/enable# echo 1 >/d/tracing/events/sched/sched_process_ exec/enable# echo 1 >/d/tracing/events/sched/sched_process_exit/enable# cat/d/tracing/trace_pipe <...> ;-9024 [003] ... 1 88396.515526:sched_process_fork:comm=ipaygphone:push pid=9024 Child_comm=ipaygphone:push child_pid=9080 < ... >-9080 [001] ... 1 88396.525226:sched_process_exec:filename=/system/bin/sh pid=9080 old_pid=9080 <...>-9080 [001] ... 1 88396.527560:sched_process_fork:comm=sh pid=9080 child_comm=sh child_pid=9083 <...>-9083 [003] ... 1 88396.528224:sched_process_exec:filename=/system/bin/chmod pid=9083 old_pid=9083 <...>-9080 [001]. .1 88396.528496:sched_process_exit:comm=sh pid=9080 prio=120 <...>-9083 [003] ... 1 88396.530442:sched_process_exit:comm=chmod pid=9083 prio=120 <...>-9083 [003] D.. 3 88396.530478:signal_generate:sig=1 ERrno=0 code=128 comm=chmod pid=9083 grp=1 res=0 <...>-9083 [003] D.. 3 88396.530567:signal_generate:sig=1 errno=0 code=128 comm=ipaygphone:push pid=9024 grp=1 res=0 <...>-90 [003] D. 3 88396.530728:signal_generate:sig=1 errno=0 code=128 comm=m.taobao.taobao pid=8889 grp=1 res=0 <...>-90 [003] D. 3 88396.530891:signal_generate:sig=1 errno=0 code=128 comm=id. Alipaygphone pid=8824 grp=1 res=0 <...>-9083 [003] D.. 3 88396.531072:signal_generate:sig=1 errno=0 code=128 comm=.searchbox:push pid=8752 grp=1 res=0 <...>-90 [003] D. 3 88396.531088:signal_generate:sig=1 errno=0 code=128 comm=om.uc.searchbox pid=8596 grp=1 res=0 <...>-90 [003] D. 3 88396.531146:signal_generate:sig=1 errno=0 code=128 comm=encent.mobileqq pid=7818 grp=1 res=0 <...>-90 [003] D. 3 88396.531201:signal_generate:sig=1 errno=0 code=128 comm=viders.calendar pid=7797 grp=1 res=0 <...>-9083 [003] D.. 3 88396.531258:signal_generate:sig=1 errno=0 code=128 comm=ndroid.calendar pid=7772 grp=1 res=0 <...>-90 [003] D. 3 88396.531261:signal_generate:sig=1 errno=0 code=128 comm=com.miui.player pid=18654 grp=1 res=0 <...>-9 083 [003] D.. 3 88396.531340:signal_generate:sig=1 errno=0 code=128 comm=id.thememanager pid=18636 grp=1 res=0 <...>-9 083 [003] D.. 3 88396.531343:signal_generate:sig=1 errno=0 code=128 comm=ugreport:remote pid=15473 grp=1 res=0
From the log we can see:
1, 9083 is the chmod process, which sends signal to all processes when the exit is executed.
2, chmod (9083) process of the parent process is sh (9080), this sh (9080) Fork finished chmod (9083) first exit.
3. SH (9080) is the Ipaygphone:push (9024) thread created by the ALI System Application Com.eg.android.AlipayGphone application.
This log is exactly the same as the previous inference and is the same time series after many validations.
Let's look at KILL_ORPHANED_PGRP () several conditions for sending Sighup:
1, TASK_PGRP (parent)! = Pgrp
The parent process by chmod (9083) sh (9080) has exited, so it is managed to init (1) and its parent is init.
Init's pgrp is init himself,
The zygote (all other native services that are the same as the Init fork) will call Setpgid () to set their pgrp to themselves after being forked out by Init:
@system/core/init.cppvoid service_start (struct service *svc, const char *dynamic_args) { ... pid_t pid = fork (); if (PID = = 0) { ... setpgid (0, Getpid ());
Zygote the descendants of the process are not explicitly called Setpgid (), their pgrp are zygote, such as chmod.
So this judging condition is satisfied, but note that it must be chmod before exiting, its parent process sh must first exit, otherwise this condition will not be satisfied.
2, task_session (parent) = = Task_session (tsk)
The session of the general all process under Android points to the Init process, so this condition also satisfies
3, Will_become_orphaned_pgrp (PGRP, Ignored_task)
Static intWILL_BECOME_ORPHANED_PGRP (structPID *pgrp,structTask_struct *ignored_task) { structTask_struct *p; Do_each_pid_task (Pgrp, Pidtype_pgid, p) {if(p = = ignored_task) | |(P->exit_state && Thread_group_empty (p)) | |Is_global_init (P-real_parent)) //This excludes zygote, because Zygote's parent process is init Continue; if(TASK_PGRP (p->real_parent)! = Pgrp &&task_session (P->real_parent) = =task_session (p))return 0; } while_each_pid_task (Pgrp, Pidtype_pgid, p); return 1;}
This function iterates through all the processes in the current process group (except for zygote), returns 0 if its parent process pgrp the pgrp of the current process, otherwise returns 1.
As already mentioned, the descendants of the zygote process, their pgrp are zygote, therefore, this function must return 1, the condition satisfies.
4, Has_stopped_jobs (PGRP)
This condition is to determine whether the process group has a stop process, and when problems occur, the front desk has "God horse search" in NE,
When NE occurs, the debuggerd process stops the target process and then prints the tombstone of the target process.
So this condition is also likely to be satisfied.
It seems that four conditions can be satisfied, of which 2, 3 are the necessary conditions, and 1, 4 have a certain probability to meet.
In order to improve the probability of recurrence, I wrote an app to make 1, 4 the necessary conditions:
@HelloAndroid. java
public class Helloandroid extends Activity {private static final String Log_tag = "PARK"; static {try {system.loadlibrary ("helloandroid"); } catch (Unsatisfiedlinkerror e) {e.printstacktrace (); }} public static native void Native_function (); @Override public void OnCreate (Bundle savedinstancestate) {super.oncreate (savedinstancestate); Setcontentview (R.layout.main); Native_function (); while (true) {try {thread.sleep (1000); } catch (Exception e) {}}}} @com_xl_helloandroid. cppstatic void Native_function () {in T child0= Fork (); if (child0 > 0) {int stop_proc = fork (); if (Stop_proc > 0) {Kill (stop_proc,19);//Manufacturing stop process to meet condition 4} else {while (true) {logd ("Stop_proc loop"); Sleep (1); }}} else {int child1 = fork (); if (Child1 > 0) {logd ("parent.. Exit ");//parent process exits, parent process of child process becomes init, satisfies condition 1Exit (-1); } else {logd ("child"); Sleep (5); LOGD ("Child exit.."); Exit (-1); } }}
After compiling the APK, the NEXUS4 4.4 version and NEXUS5 5.1.1, 6.0 version will be able to present problems.
However, this apk on the 64-bit machine does not reproduce the problem, the original 64-bit machine has two zygote.
The parent process of the 32-bit app is the zygote32,64-bit app's parent process is zygote64.
But whether it's a 32-bit app or a 64-bit app, their pgrp are all zygote64.
Since the zygote receives the Activitymanagerservice socket request, the child process is modified to PGRP after it has been forked out.
@framework/base/core/java/com/android/internal/os/zygoteconnection.java Private Boolean handleparentproc (int pid, Filedescriptor[] descriptors, FileDescriptor pipefd, Arguments Parsedargs) {if (pid > 0) {Setchildpgid(PID); if (descriptors! = null) {for (FileDescriptor fd:descriptors) {ioutils.closequietly ( FD); }} private void Setchildpgid (int pid) {//Try to move the new child into the peer ' s process group. try {Os.Setpgid(PID, Os.getpgid (peer.getpid ())); } catch (Errnoexception ex) {//This exception are expected in the case where//the peer is not in Our session/TODO get rid of this log message in the case where//GETSID (0)! = GetSID (peer.getpid ()) log.i (TAG, "Zygote:setpgid failed. This is the "+" normal if peer is not in our session "); } }
The Peer.getpid () here is the PID of System_server.
System_server's Pgid is zygote64, so the pgrp of all apps is set to Zygote64.
Let's go back and look at condition 3:
static int will_become_orphaned_pgrp (struct pid *pgrp, struct task_struct *ignored_task) { struct task_struct *p; Do_each_pid_task (Pgrp, Pidtype_pgid, p) { if (p = = ignored_task) | | (P->exit_state && Thread_group_empty (p)) | | Is_global_init (p->real_parent)) continue; if (task_pgrp (p->real_parent)! = pgrp && task_session (p->real_parent) = = Task_session (p)) return 0; } While_each_pid_task (Pgrp, Pidtype_pgid, p); return 1;}
When traversing the ZYGOTE64 process group, there must be a process that does not meet the following criteria:
TASK_PGRP (p->real_parent)! = Pgrp
Because for 32-bit apps, its parent process is zygote32, and Zygote32 's pgrp is itself.
So on a 64-bit machine, this condition is bound to be unsatisfied, so it does not restart.
To this, all doubts have been solved, give aosp feedback problem, but received the reply is temporarily not changed.
We also decided not to change the reasons for the following points:
1. This problem is caused by the special behavior of the app.
2, time window is actually very small is the time of tombstone, but because we add debugging code, the length of the tombstone lengthened, resulting in increased probability of the problem.
3, 64 bit under no problem.
After you have a problem, think about how to fix it.
Analysis Report of Zygote Restart problem caused by Linux vulnerability