Introduction and implementation of Android hardware acceleration

Source: Internet
Author: User

Overview

In the development of mobile clients, especially Android applications, we often have access to the word "hardware acceleration". Because the operating system of the underlying hardware and software encapsulation is very good, the upper layer of software developers often understand the underlying principle of hardware acceleration is very little, and do not know the meaning of the underlying principle, so there are often some misunderstandings, such as hardware acceleration is not a special algorithm to achieve page rendering acceleration, or through the hardware to improve cpu/ GPU operation rate for rendering acceleration.
In this paper, we try to introduce hardware acceleration technology from the underlying hardware principle to the upper layer code, and the upper layer implementation is based on Android 6.0.

The significance of hardware acceleration for app development

For app developers, it's easy to understand the hardware acceleration principle and the upper API implementation, so you can take advantage of hardware acceleration to improve the performance of your page. For Android, for example, a rounded rectangle button typically has two scenarios: using a PNG image, or using code (XML/JAVA). A simple comparison of the two scenarios is as follows:

Page Rendering Background knowledge
    • When a page is rendered, the element being drawn is eventually converted to a matrix pixel (in the form of a multidimensional array, similar to bitmap in Android) before being displayed by the display.
    • The page consists of a variety of basic elements, such as round, rounded rectangles, line segments, text, vector images (commonly used Bézier curves), bitmap, etc.
    • During the drawing of elements, especially during the animation process, interpolation, scaling, rotation, transparency changes, animation transitions, frosted glass blur, and even 3D transformations, physical motions (such as the common parabolic motion in the game), multimedia file decoding (mainly in desktop applications, Mobile devices generally do not use the GPU for decoding) and other operations.
    • The drawing process often requires a floating-point operation with a simpler logic, but with a large amount of data.
Introduction to CPU and GPU architecture

CPU(central processing Unit, CPU) is the core device of computer equipment, used to execute program code, software developers are familiar with it;
The GPU(graphics processing Unit, graphic processor) is primarily used to process graphics operations, and the core part of the "graphics card" is often called the GPU.

Below is a comparison chart of the CPU and GPU structure. which

    • The yellow control is the controller, which is used to coordinate the operation of the whole CPU, including taking out the instruction and controlling the operation of other modules.
    • The green Alu (arithmetic logic unit) is an arithmetic logic unit used for mathematical and logical operations;
    • The orange cache and DRAM are cached and RAM, respectively, for storing information.

      Note: From the structure diagram can be seen, the CPU controller is more complex, and the number of Alu less. So the CPU is good at all kinds of complicated logic operations, but not good at math, especially floating-point arithmetic.
    • Taking 8086 as an example, more than 100 assembly instructions are mostly logical instructions, and mathematical calculations are mainly related to 16-bit subtraction and shift operations. One-time integer and logical operations typically require one machine cycle, while floating-point operations are converted to integer calculations, and one operation can consume hundreds of machine cycles.
    • Simpler CPUs even have only addition instructions, subtraction is implemented with complement addition, multiplication is implemented by summation, and division is implemented by subtraction loops.
    • Modern CPUs typically come with a hardware floating-point operator (FPU), but are primarily suitable for situations where the volume of data is small.
    • The CPU is a serial structure. To calculate 100 numbers for example, for the CPU of a core, can only calculate the sum of two numbers at a time, the results are gradually accumulated.
    • Unlike CPUs, GPUs are designed to achieve a large number of mathematical operations. As you can see from the structure diagram, the controller of the GPU is relatively simple, but contains a large number of Alu. The ALU in the GPU uses a parallel design and has more floating-point units.
    • The main principle of hardware acceleration is to use the underlying software code to convert graphics calculations that are not good for the CPU into GPU-specific instructions, which are done by the GPU.

extensions : The GPU in many computers has its own independent video memory, and no independent memory is used in the form of shared memory to partition an area from memory for video. Video memory can save information such as GPU instructions.

Example of parallel structure: Cascade adder

For the convenience of understanding, here is an example from the perspective of the underlying circuit structure. As an adder, corresponding to the actual digital circuit structure.

    • A, b for the input, C for the output, and a, B, C are bus, take 32-bit CPU as an example, each bus is actually composed of 32 wires, each wire with a different voltage represents a binary 0 or 1.
    • Clocks are clock signal lines, each fixed clock cycle can input a specific voltage signal, each time a clock signal arrives, A and B's and will be output to C.

      Now we're going to calculate the number of 8 integers.

For the CPU of this serial structure, code writing is very simple, with a for loop to add all the numbers. The serial structure has only one adder, which requires 7 summation operations, each time the sum is calculated and then transferred to the input of the adder for the next calculation. The entire process consumes at least more than 10 machine cycles.

For parallel structures, a common design is a cascade adder, where all the clock is connected together. When the 8 data that needs to be added is ready at the input a1~b4, after three clock cycles, the sum operation is complete. If the amount of data is larger and the Cascade hierarchy is larger, the advantages of the parallel structure are more pronounced.

Due to the limitation of the circuit, it is not easy to increase the speed by increasing the clock frequency and reducing the clock cycle. Parallel architectures enable faster operations by increasing circuit size and parallel processing. But the parallel structure is not easy to implement complex logic, because it takes into account the output of multiple branches, and the process of coordinating synchronous processing is very complex (a bit like multithreaded programming).

GPU Parallel Computing Example

Let's say we're like a processing task, adding 1 to each pixel value. GPU Parallel computing is simple and rough, in the case of resource permitting, you can open a GPU thread for each pixel, which is added 1 operation. The larger the mathematical computation, the more obvious the performance advantage of this parallel mode.

Hardware acceleration in Android

In Android, most of the app's interfaces are built with regular view (except for apps such as games, videos, images, etc., which may use OpenGL ES directly). The following is a comparison of the software and hardware-accelerated rendering of the view based on the Java layer code of the Android 6.0 native system.

Displaylist

Displaylist is a basic drawing element that contains the original attributes of the element (position, size, angle, transparency, and so on), corresponding to the Drawxxx () method of the canvas (for example).

Information delivery process: GPU, Driver----OpenGL (c + + Lib), Canvas (Java API).

On Android 4.1 and later, Displaylist supports attributes, and if some of the properties of the view change (such as scale, Alpha, Translate), simply update the properties to the GPU without generating a new displaylist.

Rendernode

A rendernode contains several displaylist, usually a rendernode corresponding to a view that contains all the displaylist of the view itself and its child view.

Android Drawing process (Android 6.0)

The following is an Android view complete drawing flowchart, mainly by reading the source and debugging, the dashed arrows indicate recursive call. Here is a simple process of drawing:

    1. From Viewrootimpl.performtraversals to PhoneWindow.DecroView.drawChild is a fixed process that iterates through the view tree, first judging whether the layout needs to be re-layout and execution based on the flag bit, and then creating the canvas to start drawing System.
      Note:
      If hardware acceleration is not supported or is turned off, software is used to draw the resulting canvas, which is the Canvas.class object;

If hardware acceleration is supported, the Displaylistcanvas.class object is generated;

The value returned by the ishardwareaccelerated () method for both is false, and True,view determines whether hardware acceleration is used based on this value.

    1. Draw (Canvas,parent,drawingtime) in view-draw (canvas)-Ondraw-dispachdraw-drawchild This recursive path (hereinafter referred to as the draw path), Called the Canvas.drawxxx () method, which is used for actual drawing during software rendering, and for building displaylist when hardware is accelerated.
    2. Updatedisplaylistifdirty-dispatchgetdisplaylist in view-
      Recreatechilddisplaylist This recursive path (hereinafter referred to as the displaylist path) is only passed during hardware acceleration and is used to update the Displaylist property during the traversal of the view tree drawing. And quickly skip the view that doesn't need to be rebuilt displaylist.
    3. Android 6.0, and Displaylist related API is still marked as "@hide" is not accessible, indicates immature, subsequent versions may be open. With hardware acceleration, the draw process finishes after the Displaylist build is complete, and then draws displaylist to the screen through Threadedrenderer.nsyncanddrawframe () using the GPU.
      The complete drawing process is as follows:
Pure software rendering VS hardware acceleration

According to the following specific scenarios, the specific analysis of hardware acceleration before and after the process and acceleration effect.

Description
In scenario 1 , traversing the view tree and going to the draw path, regardless of acceleration or not. After the hardware acceleration draw path does not do the actual drawing work, just constructs the displaylist, the complex drawing computation task is the GPU to share, already has the big acceleration effect.

In scene 2 , the TextView is set to the same size before and after setting without triggering a re-layout.

    • When the software is drawn, the TextView area is the dirty area. Since TextView has transparent areas, most of the view that overlaps the dirty area is redrawn, including the sibling nodes and their parent nodes (described later), which do not need to draw the view in draw (Canvas,parent, Drawingtime) method to determine the direct return.
    • After hardware acceleration, you also need to traverse the view tree, but only TextView and each of its parent nodes need to rebuild displaylist, go to the draw path, the other view goes directly to the displaylist path, the rest of the work is given to the GPU processing. The more complex the page, the more obvious the performance gap is.

In scenario 3 , the software draws a lot of work for each frame, which can easily cause the animation to stutter. After the hardware acceleration, the animation process goes directly to the Displaylist path to update the properties of the Displaylist, and the animation smoothness can be greatly improved.

In Scenario 4 , the performance gap between the two is more pronounced. Easy to modify the transparency, software drawing still do a lot of work, hardware acceleration after the general directly update the Rendernode properties, do not need to trigger invalidate, and do not traverse the view tree (except for a few view may have to do a special response to Alpha and in the Onsetalpha () Return true, the code is as follows).

public class View {// ...  public void Setalpha (@FloatRange (From=0.0 , To=1.0 ) float Alpha) {ensuretransformationinfo (); if  (Mtransformationinfo.malpha! = Alpha) {mtransformationinfo.malpha = Alpha; if  (Onsetalpha ((int) (Alpha * 255 ))) {// Invalidate (true); } else  {// Mrendernode.setalpha (Getfinalalpha ());  ...  }}} protected Boolean onsetalpha (int alpha) {return  false; }// ... } 
Introduction to software drawing Refresh logic

Actually read the source code and experiment, draw the usual situation of software drawing refresh logic:
1, by default, view's Clipchildren property is true, meaning that each view drawing area cannot exceed its parent view's range. If you set the Clipchildren property of a page root layout to false, the child view can exceed the drawing area of the parent view.

2, when a view triggers invalidate, and no animation is played, no layout is triggered:

    • For a fully opaque view, it sets the flag bit pflag_dirty itself, and its parent view sets the flag bit pflag_dirty_opaque. In the Draw (canvas) method, only the view itself is redrawn.
    • For a view that may have a transparent area, both itself and the parent view set the flag bit pflag_dirty.
      When Clipchildren is true, the dirty area is converted to a rect in Viewroot, and when refreshed, the layers are judged downward, and when the view overlaps with the dirty area, it is redrawn. If a view is out of the parent view range and overlaps the dirty area, but its parent view does not overlap The dirty area, the child view is not redrawn.
      When Clipchildren is False, Viewgroup.invalidatechildinparent () expands the dirty area to its entire area, and all view overlaps with the area are redrawn.
Summarize

For hardware acceleration Let's summarize:
-CPUs are better at complex logic controls, while GPUs benefit from a large number of ALU and parallel structure designs, and are better at math operations.
-the page is made up of various base elements (Displaylist) that require a lot of floating-point arithmetic to render.
-Under hardware acceleration, the CPU is used to control complex drawing logic, build or update Displaylist;gpu for graphical calculations, rendering displaylist.
-With hardware acceleration, the CPU only rebuilds or updates the necessary displaylist to improve rendering efficiency, especially when playing animations.
-to achieve the same effect, you should try to use simpler displaylist to achieve better performance (shape instead of bitmap, etc.).

Introduction and implementation of Android hardware acceleration

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.