Android anomaly Analysis (RPM)

Source: Internet
Author: User
Tags throwable

About exception exceptions?

An exception is an unexpected problem in a program, and since it is not anticipated, it may not be in the original logical processing range, out of the control of the code, the software may appear a variety of strange phenomena. For example: Android system common abnormal phenomenon has the application is unresponsive, the application stops running, freezes the screen, restarts, freezes and so on, these abnormal system has the unified exception processing mechanism, the abnormal system will carry on the corresponding operation, finally has the corresponding phenomenon manifests. In addition, some not expected to display the interface problems, operational problems, running problems, such as the problem can also be attributed to the exception, but this anomaly is a human logic defect, the system is normal, but these defects in the abnormal phenomenon accounted for is quite large, directly reflects the quality of software.

The schema determines the logic, and the logic determines how unusual

Unusual importance

All say that ISO is better than Android, iphone is better than Android, why? In fact, the most basic reason is the ISO system stability and experience to do well, very few anomalies, the use of a period of time to run or very robust, and its interface, operation, speed and other experience also do very good, so it is recognized by everyone.

Anomaly relationship a software's stability

Defect relationship the performance and experience of a software

Create products, the pursuit of excellence, for software developers is the pursuit of 0 anomalies, zero defects. We do the software, responsible for the application of the quality of the module, is not a boutique is through the abnormal quantity and the number of defects to reflect. This document is mainly about log analysis, belongs to the post-processing, processing is the user's complaints and dissatisfaction, dealing with the development of our buried thunder or not mined, is passive. Therefore, more important is the software volume prenatal development work, how to reduce anomalies and defects, to ensure software quality.

(company strategy, requirements for Research and development Department)

Exception classification

Android is a large and complex system that involves multiple languages, so its anomalies are also complex. According to the Android system architecture level, we also have the Android exception hierarchy, divided into JE, NE, KE, EE, other categories

L JE (Java layer exception) is typically an exception that occurs at the application and framework levels, usually caused by Java code, XML code. such as various runtimeexception, ANR (application not responding), SWT (software Watchdog Timeout), etc.

L NE (Native layer Exception) is an exception that occurs in Linux user space, usually caused by C/s Code and library files. such as the NE signal issued by the kernel (Sigill, SIGABRT, Sigbus, etc.)

L KE (Kernel layer execption) usually refers to kernel failure or kernel error, due to the error in kernel mode, this kind of exception is very serious, often lead to reboot, crash or no boot, etc.

L EE (External (Modem) exception) from the name can guess the Modem this part is relatively special, independent. Modem has its own memory space and code, to provide services for mobile communication, once this part of the abnormal, need to mdlog, this log needs to Aee-logvie tool parsing, parsing is the need for the corresponding version of the data file, the specific use can refer to the Gat_user_guide ( Customer). pdf document

L Other than the above types, there may be no obvious categories of exceptions, such as some hardware-caused exceptions

Android System architecture diagram

Abnormal recurrence and abnormal log printing

One of the keys to solving an exception is to reproduce the exception. For example, if you can find a required path for an occasional exception, the problem becomes much easier. To solve the abnormal problem, we must first understand the anomaly, how the anomaly happened, and what happened under what conditions. Here's where you need to be aware of abnormal recurrence.

Abnormal recurrence Note points:

L carefully read the description of the exception, to find out the abnormal production steps, abnormal probability, abnormal pre-conditions, and pre-sentence what kind of exception

L Confirm whether to play log before the recurrence, if it is an occasional problem, be sure to turn on this exception type requires log

L reproduce the exception according to the description, if it is an occasional problem, pay attention to the conditions, try to find out the abnormal must-present path

L If there is no recurrence exception, communicate with the abnormal information provider, reproduce again.

The client reported an exception that could be normal.

Log printing

The key to solving the exception is to grab a valid log. For example, ANR exception must crawl bugreport or trace.txt file, NE exception must crawl aee_exp, EE exception must crawl mdlog. According to the different types of exceptions to crawl different log, targeted analysis. Here is where the exception log printing needs to be noted.

A wrong log is an analysis of the problem, an exception, the failure to catch the correct log, you can waste a remediation opportunity

Exception Printing Note points:

L Remove the original log file from SD card and internal storage, and reduce the analysis problem caused by unnecessary log before catching log.

L SET the pre-conditions of the exception, especially the exceptions that need to be compared, to ensure the same preset condition before catching log.

L Open the required log according to the type of exception. Any exception, Mtklog is necessary, restart, freezing abnormal, as much as possible to catch log

L capture Log, record the abnormal appearance of the phone display time, if necessary, along with the description of the exception in the log

Abnormal analysis of ANR ANR types

L Key Dispatch Timeout (8s)

No response to a input event (e.g key press, screen touch) within 8 seconds

L Broadcast Timeout

A broadcastreceiver hasn ' t finished executing within setting seconds

Broadcast_fg_timeout:10s

broadcast_bg_timeout:60s

L Service Timeout (20s)

Request Service failed within seconds

Events such as keystrokes or broadcasts that do not respond at a specific time, which are set in the system at a specific time, may not be the same on each platform, and the time above is the default timeout for the KK platform, which is generally defined in the Activitymanagerservice.java class, such as:

static final int key_dispatching_timeout = 5*1000

Causes of ANR Production

The application process has a main thread and an information queue (main message queue) Main THEAD = = Activity Thread

L main thread handles UI events like draw, Listen, receive, etc.

L The main thread is responsible for extracting information from the message queue and distributing it

L The main thread will not fetch information from the information queue until the current processing is complete.

If the main thread is stuck while processing the current information and is not distributed in a timely manner, the ANR will appear

How to avoid ANR

UI threads try to do only UI-related work

L Time-consuming work (such as database operations, I/O, network connections, or other actions that might hinder the UI thread) putting it into a separate thread processing

L Use handler to handle the interaction between UI thread and other thread

Log Required for ANR analysis

L Mtklog, mainly aee_exp and Mobilelog.

L Trace.txt file (Data/anr directory) or bugreport log (output using adb bugreport > Bugreport.txt or GAT tools)

ANR Analysis Process

Because the ANR type is many, the condition that triggers the ANR is also many, and the log does not have the obvious keyword fatal like the runtimeexception anomaly to locate the problem point accurately, therefore, the ANR analysis is comparatively troublesome point, but as long as has the complete log, according to the method analysis or is very quick. is the MTK analysis of the ANR flowchart, through the ANR trigger type step by step find the exclusion

MTK Analysis of the ANR flowchart

L First, check if the log has ANR information

Events_log

00:28:19.999 544 564 I am_anr: [0,3003,com.example.test, 11058758,keydispatchingtimedout]

Main_log or Sys_log

00:28:31.193 544 564 E Anrmanager:anr in Com.example.test (com.example.test/. Mainactivity)

Traces.txt

-----pid 3003 at 2013-06-01 00:28:20 -----

CMD line:com.example.test

Jni:checkjni is off; Workarounds is off; pins=0; globals=147

DALVIK THREADS:

(mutexes:tll=0 tsl=0 tscl=0 ghl=0)

"Main" prio=5 tid=1 SUSPENDED

| group= "main" scount=1 dscount=0 obj=0x40d5ea18 Self=0x40d4e0d8

| systid=3003 nice=0 sched=0/0 Cgrp=apps handle=1074645084

| State=s schedstat= (16757266877 27764681051 104147) utm=1184 stm=491 core=0

#00 pc 0002746c/system/lib/libc.so (__futex_syscall3+8)

#01 pc 0000f694/system/lib/libc.so (__pthread_cond_timedwait_relative+48)

.........

#12 pc 00020580 [Stack]

At Libcore.io.Posix.strerror (Native Method)

At Libcore.io.ForwardingOs.strerror (forwardingos.java:128)

At Libcore.io.ErrnoException.getMessage (errnoexception.java:52)

At Java.lang.Throwable.getLocalizedMessage (throwable.java:187)

At Java.lang.Throwable.toString (throwable.java:361)

At Java.lang.Throwable.printStackTrace (throwable.java:321)

At Java.lang.Throwable.printStackTrace (throwable.java:355)

At Java.lang.Throwable.printStackTrace (throwable.java:288)

At Java.lang.Throwable.printStackTrace (throwable.java:236)

At Com.example.test.MainActivity.monitorANR (mainactivity.java:200)

At Com.example.test.mainactivity$1.handlemessage (mainactivity.java:38)

At Android.os.Handler.dispatchMessage (handler.java:107)

At Android.os.Looper.loop (looper.java:194)

L If you can't locate the information point, then look at the cup usage

Main_log

06-01 00:28:31.193 544 564 E Anrmanager:anr in Com.example.test (com.example.test/. mainactivity)

06-01 00:28:31.193 544 564 E ANRManager:Reason:keyDispatchingTimedOut

06-01 00:28:31.193 544 564 E anrmanager:load:10.5/11.94/6.06

06-01 00:28:31.193 544 564 E anrmanager:android time: [2013-06-01 00:28:31.176] [454.712]

06-01 00:28:31.193 544 564 E anrmanager:cpu usage from 0ms to 11736ms later:

06-01 00:28:31.193 544 564 E anrmanager: 34% 3003/com.example.test: 26% user + 8.4% kernel/faults:708 minor 10 Major

06-01 00:28:31.193 544 564 E anrmanager:32% 3018/logcat:10% user + 21% kernel/faults:4143 minor

06-01 00:28:31.193 544 564 E anrmanager:23% 379/mobile_log_d:8.7% user + 14% kernel/faults:10 minor 1 major

06-01 00:28:31.193 544 564 E anrmanager:19% 171/adbd:1.7% user + 17% kernel/faults:423 minor

06-01 00:28:31.193 544 564 E anrmanager:18% 544/system_server:8.5% user + 9.8% kernel/faults:899 minor 2 major

06-01 00:28:31.193 544 564 E anrmanager:14% 132/mobile_log_d:1.7% user + 13% kernel

......

06-01 00:28:31.193 544 564 E anrmanager:96% total:36% user + 60% kernel + 0% iowait

From CPU usage, you can see

If the CPU usage is close to 100%, it indicates that the current device is busy (out of memory, loop processing, etc.)

If CPU usage is low, the main thread is blocked (activity exceeds 5 seconds, etc.)

If the iowait is high, it is possible that the ANR is the main thread that caused the I/O operation (Database operations, file operations, network operations, etc.)

• Find useful information in Main_log and Event_log based on CPU usage

Main_log

L Combine log to see the code, find the reason

In order for the ANR to appear, in the onclick with a while (true), continuous file read and write, error constantly printing (do not hit log)

ANR waiting for the lock to be caused

L events_log

22:05:22.819934 732 755 I am_anr: [0,24992,com.example.test,8961606,input dispatching timed out (Waiting because the Touc Hed window has not finished processing the input events this were previously delivered to it.)

L main_log or Sys_log

01-01 22:05:22.857387 732 755 E anrmanager:anr in Com.example.test (com.example.test/. mainactivity)

01-01 22:05:22.857387 732 755 E anrmanager: reason:input dispatching timed out (waiting because the touched wind OW have not finished processing the input events, were previously delivered to it.)

L traces.txt

-----pid 29364 at 2014-01-01 22:05:22-----CMD line:com.example.test

Jni:checkjni is off; Workarounds is off; pins=0; globals=263

DALVIK THREADS: (mutexes:tll=0 tsl=0 tscl=0 ghl=0)

"Main" prio=5 tid=1 MONITOR | group= "main" scount=1 dscount=0 obj=0x419cede0 self=0x419bd8a8

| systid=29364 nice=0 sched=0/0 Cgrp=apps handle=1074139524

| State=s schedstat= (265882702 297191749 665) utm=19 stm=7 core=0

At com.example.test.mainactivity$anrbroadcast.onreceive (mainactivity.java:~120)

-Waiting to lock <0x41edc968> (a Java.lang.Object) held by tid=11 (Thread-720) at Android.app.loadedapk$receiverdis Patcher$args.run (loadedapk.java:798) at Android.os.Handler.handleCallback (handler.java:808)

L corresponding Code

Reason for the oom oom of abnormal analysis

Android application memory management mechanism is improved on the basis of Java memory Management mechanism, so the cause of oom is similar, that is, all objects are allocated space on the heap, the heap is a size limit, when the allocated object can not be recycled still occupy the heap space, the newly allocated object can not get enough heap space, Will be oom. Why is that? This is where the GC is deficient, and the GC can only reclaim objects that are inaccessible to its own records (to the tree), which are considered useful to the objects that can be reached and are not recycled. But the unreachable object is not necessarily a useful object, they may be discarded objects (dead objects, redundant objects, light bulbs, zombies), but can not be recycled by GC, occupy the process heap space, below is an object instantiation diagram on the net

Various oom Scenarios

L Resource objects are not recycled, such as CURSOR,BITMAP, etc.

Methods that usually close the cursor:

cursor cursor = mdownloadmanager.query (new query ());

try {

if (Cursor.movetofirst ()) {

do {

int index = Cursor.getcolumnindex (downloadmanager.column_id);

Long Downloadid = Cursor.getlong (index);

Ids.add (Downloadid);

} while (Cursor.movetonext ());

}

} finally {

Cursor.close ();

}

Also, when using the cursor in adapter, you need to close the cursor when the cursor changes, but usually we are using Android-provided CursorAdapter, whose changecursor function will release the original cursor, and replace it with the new cursor, so you don't have to worry about the original cursor being closed.

L Registration no corresponding to register, such as a variety of monitoring

L life cycle problems caused by the inability to recycle, if static, threading, etc.

L Other

All occurrences of an oom scenario can ultimately be thought of as an object not being recycled, such as a cursor without close (), Bitmap without recycle (), and no unregister for monitoring ... () and so on because the object is not recycled, the GC considers these objects to be accessible and in use, causing the objects that should be recycled to not be recycled, resulting in oom.

Most of the recycling methods, such as Close (), recycle (), unregister ... (), in fact, the object is no longer used to null, so that the GC can reclaim the space occupied by the original object. So in programming, for global variables, especially objects such as containers and status modifiers, it is necessary to focus on its life cycle, no longer needing to be null in time or to invoke the appropriate recycling method

OOM Log Analysis

After an oom exception occurs, if only mtklog, only from the log to know that the oom has occurred, but how to do not see it, so usually need to oom analysis tools, the following with the Mat tool as an example

In Eclipse, monitor the process in which you need to analyze oom and, under some rules, discover that the process memory has been rising, grabbing the Hprof file:

One of the rules here is that, under some kind of operation, an oom is repeated. Often cause oom operation has to switch interface, back to slide list, constantly click on a button, and so on, these operations are constantly updated interface, continuous production objects, the production of objects resulting in more and more heap space, the eventual occurrence of oom

DDMS dump hprof files need to be converted by the SDK under Hprof-conv (under Sdk/tools) to be used by the mat tool

Hprof-conv xxx.hprof d:/xxxold.hprof

Then use the Mat tool to open

1. Cache leaks

After plugging the headset multiple times, it is found that the memory has been rising:

Click Details go to the following page :

Click Patch to GC Root:

Find a static variable sanimators, this is the suspect place, view the code, add a bit of log, compile the debug:

Log. D ("CWW", "sanimators.size () =" + sanimators.size ());

As you can see, the sanimators.size will always increase after the headphone operation is plugged in.

Processing method: Prevent the cache too large, you can set the upper limit, you can also regularly clean down!

Memory-sensitive applications, which prevent the cache from being too large, in addition to setting the upper limit, while using softreference, can recycle the cache when memory is tight, a trick to prevent programming, but when using softreference, pay attention to the handling of NULL cases, Because the fetch object may have been reclaimed, getting returned is null

2. leak caused by thread not releasing

Play music in the background, switch themes, and finally launcher OOM

For example, 5 Appscustomizepagedview instances are clearly leaking:

Click to select an instance, Path to GC Roots:

Have seen the Mytimertask held by Circleprogress.java:

Then look at the code, modify, Debug memory, Normal:

This resolves the memory leak, but introduces a new feature problem and re-modifies it later. So be careful when modifying a similar problem, confirming that the life cycle has been completed before you perform the Recycle

Conclusion:

Most of these problems are more difficult to solve, most of them are random, often is difficult to reproduce, to find out the law is very important!

In addition, memory leaks are not easy to find, some minor leaks, may have to use one months to find, so on their own module, to check whether there is an oom, you can hang on the monkey, sometimes can run out of

From Oom Lenovo to performance issues, many performance issues are caused by interface refreshes, object lifecycles, redundant operations, unnecessary threads ...

The SWT of abnormal analysis knows SWT

SWT refers to Android Watchdog timeout, application layer watchdog timeout, usually we say WDT (Watchdog timeout) is HWT, hardware watchdog timeout. Application Layer Watchdog main implementation is in the Frameworks/base/services/java/com/android/server/watchdog.java, its implementation principle to see this class know, the main logic is:

1. Watchdog is a singleton mode, monitoring system several more important service, such as: Mountservice, Activitymanagerservice, Inputmanagerservice, etc., These service are started by calling Watchdog.getinstance (). Addmonitor (this); Added to watchdog's monitoring list

2. In Systemserver's Serverthread thread, initialize the watchdog and start it

3. Watchdog thread sends a monitor message to the Serverthread thread while the mcompleted flag position is false

4. Then the watchdog thread sleeps for 60 seconds (does not contain the time of the system sleep), if the mcompleted flag bit is not true, the watchdog timeout is considered to occur, and then Android restarts.

5. Serverthread receives this message and then executes the monitor () function of each service object in turn, and the mcompleted flag position is true when it finishes executing.

SWT Log Analysis

SWT is also a kind of ANR, the general ANR is the main thread of an AP has not done something for a period of time; SWT is the serverthread thread of the systemserver process that has not done something for a while. So SWT analysis method and ANR analysis method is the same, but the phenomenon is not the same, the occurrence of SWT phone will restart

Analysis Method:

1. Use watchdog as keyword search from eventlog to record this moment.

2. Then analyze why all the Service object monitor () is not finished within 1 minutes of this time. Specific information mainly find log files have Sys_log and Mtklog\aee_exp\db.fatal.00.swt\db.fatal.00.swt.dbg.dec

3. The following specific analysis method is the same as the ANR

Restart crash restart

From the anomaly classification, most of the restart exception is related to NE, ke and hardware problems, je caused by the restart of the panic is mostly related to the system process, such as the system_process process occurred crash, SWT, JVM ERROR,AP application generally will not cause a restart of the panic, But occasionally it will

72 on the platform, send SMS content as ' = = ' will restart

Although it is caused by MMS, but eventually system_process hung up, causing the restart

Restart Exception Analysis Step (JE):

1. Confirm the type of exception (use Qaat run for preliminary judgment, if the NE, ke let driver help solve)

2. Find the place where the error occurred the first time, because the next error is mostly caused by the previous error, it is meaningless

3. Analyze log with corresponding tool according to JE type

Crash

Here said the crash is frozen screen, stay in an interface did not respond. The crash problem is rarely encountered, and most are not a use layer problem, the following simple can cause the crash and analysis needs information

Possible causes of the crash:

1. Input system or Input driver problem

2. System logic problems or blocking

3. Surfacefinger Issues

4. Display system or LCM driver issues

Related information and capture log:

1. Confirm that ADB is available

2. Grab bugreport,adb bugreport > D:/bugreport.txt

3. Crawl dumpstate information, adb shell dumpstate > D:/dumpstate.txt

4. Crawl CPU information, adb shell top–t–m 5 > D:/cpu.txt

5. Confirm if you can make a call, adb shell am start–a android.intent.action.CALL tel:10086 (see if the interface can be updated)

6. Check the key and touch-screen report, ADB shell getevent

7. Crawl Surfacefinger process information, first ADB shell ps–p find PID, then use adb shell rtt–f bt–p pid > Rtt.txt

Log related Tools

MTK provides a variety of tools to crawl and view log, such as: Mtklogger,gat,catcher,logview,qaat, which are described in the document "Mediatek_logging_sop.pdf"

Mtklogger:

Mtklogger is a crawl of the log apk, integrated Modemlog,mobilelog,networklog and Systemlogger, in the engineering mode operation can play the relevant log.

GAT :

GUI tool based on SDK Debug development, new log recoder,debug Configuration setting,dbpuller,adb command,process Information view,profiling Tools,logview,plug-in Script. It is the artifact that debugs and grabs log, use instruction to read the document "Gat_user_guide (Customer)". pdf

Tool get Path (take W1444 version for example):

\\192.168.1.75\rd\MTK_TOOL\AndroidTool\W1444\W1444_full.zip\Debugging Tools (Binary) \gat

Catcher :

is a PC-side tool for crawling and parsing modemlog, which we often use to view modemlog, read the document "Catcher_user_manual_for_customer.pdf" with instructions

Tool Get Path:

\\192.168.1.75\rd\MTK_TOOL\AndroidTool\W1444\W1444_full.zip\Catcher

LogView :

You can view the Aplog,taglog,mtklog, but the most common is the log in the AEE db file that is used to view the NE, using the reference gat_user_guide (Customer). pdf

Tool Get Path:

This tool has been integrated into the GAT

Qaat :

Fast analysis of log tools, covering a wide range of errors, many places can be used, in fact, his principle is to filter the keywords, the various types of errors filtered out, is a very convenient analysis log tool, the use of reference "Mediatek_logging_sop.pdf"

Tool Get Path:

Accessories or \\192.168.1.75\rd\MTK_TOOL\AndroidTool

Android anomaly Analysis (RPM)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.