Analysis of kernel panic in the android kernel

Last Update:2013-12-11 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. What is OOPSOops. It means something unexpected, surprising, or sudden. "Oops" is not very serious, just as in the lyrics of the song "Oops I Did It Again" in the British song Spears, It is also an understatement, sometimes containing sorry. Http://v.youku.com/v_show/id_XMTM0ODgxMDYw.html for Linux kernel, Oops unexpected kernel exception, this will generate abnormal CPU status, wrong command address, data address and other registers, the function call sequence is printed out, and the content in the stack is determined based on the severity of the exception: killing the abnormal process or suspending the system. The most typical exception is that an Invalid Address is referenced in the kernel state, usually the uninitialized wild pointer Null. This will cause a page table exception and eventually lead to Oops. The Linux system is robust enough to respond to various exceptions. Exceptions usually lead to the death of the current process, and the system can continue to run, but this operation is in an unstable state and may cause problems at any time. Exceptions that interrupt the context and damage to critical system resources usually cause kernel suspension and no response to any events. 2. kernel exception level 2.1 BugBug refers to problems that do not conform to the normal kernel design but can be detected by the kernel and will not affect system operation, such as sleeping in an atomic context. For example, BUG: scheduling while atomic: insmod/826/0 x00000002Call Trace: [ef12f700] [c00081e0] show_stack + 0x3c/0x194 (unreliable) [ef12f730] [c0019b2c] _ schedule_bug + 0x64/0x78 [ef12f750] [c0350f50] schedule + 0x324/0 x34c [ef12f7a0] [c03515c0] Limit + 0x68/0 xe4 [ef12f7e0] [c027938c] primary + 0x138/0 x1c0 [ef12f820] [c0275820] primary + 0x130/0 x3dc [ef12f880] [c0275ebc] nand_read + 0xac/0xe0 [Ef12f8b0] [c0262d98] part_read + 0x5c/0xe4 [ef12f8c0] [c017bcac] California + 0x68/0x254 [ef12f8f0] [c0170550] California + 0x60/0x304 [ef12f940] [c017088c] California + 0x98/0x180 [ef12f970] [c016e610] California + 0x94/0 x1ac [ef12f990] [c016ee04] California + 0x2b0/0x330 [ef12fa10] [c005144c] generic_file_buffered_write + 0x11c/0x8d0 [ef12fab0] [c0051e48] _ generic_file _ Latency + 0x248/0x500 [ef12fb20] [c0052168] latency + 0x68/0 x10c [ef12fb50] [c007ca80] do_sync_write + 0xc4/0x138 [ef12fc10] [f107c0dc] oops_log 0xdc/0x1e8 [oopslog] [ef12fe70] [f3087058] records + 0x58/0 xa0 [oopslog] [ef12fe80] [Signature] records + 0x130/0 x17dc [ef12ff40] [c001_b0] + 0x0/0x38 --- Exception: c01 at 0xff29658 LR = 0x10031300 2.2 Oops the program enters Exception situations, such as data exceptions caused by reference of invalid pointers and finger fetch exceptions caused by out-of-bounds arrays. In this case, the exception handling mechanism can capture this exception and print the system's key information to the serial port, normally, Oops messages are recorded in system logs. When Oops occurs, the process is in the kernel state. It is likely to be accessing key system resources and getting some locks. When the process exits due to abnormal Oops, it cannot release the acquired resources, as a result, other processes that need to obtain the resource are suspended, affecting the normal operation of the system. In this case, the system is in an unstable state and may crash. 2.3 Panic when Oops occurs in the interrupt context or in processes 0 and 1, the system will be completely suspended because it will not be able to be restored after the interrupt service program exception, which is called the kernel panic. In addition, when the system sets the panic flag, whether the Oops occurs in the interrupt context or process context, it will cause the kernel Panic. After the panic in the reset program is interrupted, the system will no longer schedule and Syslogd will no longer run. In this case, Oops messages will only be printed to the serial port, it is not recorded in system logs. Example of Kernelpanic debugging: [242.788019] bluesleep_outgoing_data: tx was sleeping [244.012224] ****** host_wake is 1 [245.234647] Disable_key_during_touch = 0 [245.237802] huqiao ___ button-> code = 139, state = 1 [245.414640] Disable_key_during_touch = 0 [245.417542] huqiao ___ button-> code = 139, state = 0 [245.821424] ****** host_wake is 0 [245.823708] bluesleep_hostwake_isr: [I] waking up... [245.823713] [245.830155] Bluesleep_hostwake_task: bluesleep_hostwake_task is called [245.838356] Unable to handle kernel NULL pointer dereference at virtualaddress 00000008 [245.845678] pgd = c0004000 [245.848188] [00000008] * pgd = 00000000 [245.851751] Internal error: oops: 5 [#1] preempt smp arm [245.857122] Modules linked in: [245.860080] CPU: 0 Tainted: g w (3.4.0-perf-svn874 #1) [2, 245.866444] PC is at sco_connect_cfm + 0x 380/0 x4e8 [245.871106] LR is at 0xd880 [245.873800] pc: [<c07446c0>] lr: [<symbol d880>] psr: 40000013 [245.873805] sp: dbe55e78 ip: 00000000 fp: listen [245.885246] r10: d8643998 r9: d8e5b80d r8: d8643830 [245.890529] r7: dbe54000 r6: d9e5b600 r5: cae27c80 r4: d8643800 [245.896968] r3: 00000008 r2: 00000000 r1: d7d96016 r0: 00000000 [245.903552] Flags: nZcv IRQs on FIQs on Mode SVC_3 2 isa arm Segment kernel [245.910772] Control: 1037987d Table: 5a47406a DAC: 00000015 [245.916576] [245.916579] PC: 0xc0744640: [245.920751] 4640 e3310000 1 afffffa f57ff04f e320f004 %%e5973000 [245.928910] 4660 ea%42 %e300332a e%30b3 %%e59f1198 as shown in the preceding figure. When a kernel panic is displayed, the stack information is. We can see [245.866444] PC is atsco_connect_cfm + 0x380/0 x4e8, And we will know that there is a problem with the sco_connect_cfm function. In general, from LR (link register), we can know which function is called by hci_proto_connect_cfm. When Unable to handle kernel NULL pointerdereference at virtual address 00000008 is displayed, an Invalid address is applied to this function. In linux, the maximum 1 GB (from the virtual address 0xC0000000 to 0 xFFFFFFFF) is used by the kernel, which is called the "kernel space ". However, because 3G bytes (from the virtual address 0x00000000 to 0 xBFFFFFFF) are used by various processes, it is called "user space"), there is a problem because the kernel uses the user space address illegally. It is generally difficult to reproduce the kernel panic, so I plan to use the code in the kernel to simulate this phenomenon. Static inlinevoid initialize (struct hci_conn * conn, _ u8 status) {register struct hci_proto * hp; hp = hci_proto [HCI_PROTO_L2CAP]; if (hp & hp-> connect_cfm) hp-> connect_cfm (conn, status); hp = hci_proto [HCI_PROTO_SCO]; if (hp & hp-> connect_cfm) hp-> connect_cfm (conn, status ); if (conn-> connect_cfm_cb) conn-> connect_cfm_cb (conn, status);} when I change the function to static inlinevoid hci_proto_connect_cfm (struct Hci_conn * conn, _ u8 status) {register struct hci_proto * hp; hp = hci_proto [HCI_PROTO_L2CAP]; if (hp & hp-> connect_cfm) hp-> connect_cfm (conn, status); conn = NULL-21; // Simulation this phenomenon, hp = hci_proto [HCI_PROTO_SCO]; if (hp & hp-> connect_cfm) hp-> connect_cfm (conn, status); if (conn-> connect_cfm_cb) conn-> connect_cfm_cb (conn, status);} this phenomenon will completely recur. In fact, according to the definition of the hci_conn struct, we will know that the address of hcon-> type is 00000008, so we will understand that in the initial code, when we call sco_connect_cfm, the address of the incoming variable conn is changed to NULL-21, but there is no problem in running hp-> connect_cfm (conn, status) in front of it, the conn address is transmitted to hp-> connect_cfm (conn, status), and there is no change. So I started to get depressed. Why is the address suddenly invalid? Later, I checked it online to find out that it could be a hardware problem that caused a temporary error to a specific address. The cause is found, and the bug is analyzed.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Analysis of kernel panic in the android kernel

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support