Neon Introduction "Turn"

Source: Internet
Author: User
Tags dbx scalar

Transferred from: http://blog.csdn.net/fengbingchun/article/details/38020265

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

"ARM Advanced SIMD", nick-named "NEON", It provides: (1), A set of interesting scalar/vectorinstructions and registers (the LA Tter is mapped to the same chip area as THEFPU ones), comparable to mmx/sse/3dnow! in the world; (2), vfpv3-d32 as a req Uirement (i.e-hardware FPU 64-bit registers,instead of the minimum of 16).

NEON technology as it is used on ARM cortex-a seriesprocessors This implement the ARMV7-A or Armv7-r architectures profile S.

The ARMV8 architectural architecture extends Theneon support, and provides backwards compatibility with ARMV7 Implementati Ons.

ARM Enon Technology accelerates multimedia and signal processing algorithms such as video encoding/decoding, 2d/3d graphics, gaming, audio and speech processing, image processing technology, telephony and sound synthesis, at least 3 times times the performance of ARMV5 and twice times the ARMv6 SIMD performance. Neon Technology is a 128-bit SIMD architecture extension of the arm cortex-a family of processors designed to provide flexible and powerful acceleration capabilities for consumer multimedia applications.

Starting from ARMV7 arm provides advanced single instruction multiple data (SIMD) extensions, also known as Neon Technology, a 64/128-bit hybrid SIMD architecture developed by arm that improves the performance of multimedia and signal processing applications.

Neon Register: There are 16 128-bit four-character registers q0-q15,32 a 64-bit double-word register d0-d31, two registers are overlapping, need special attention when using, the careless will be covered out.

Neon data type: unsigned integer, signed integer, integer with unspecified type, floating-point number, polynomial on {0,1}. The data type is for the operand, not the target number. The data type specifier in neon consists of a letter that indicates the data type, which is usually followed by a number indicating the width.

The neon directive can handle: (1), a two-character vector consisting of eight 8-bit elements, four 16-bit elements, two 32-bit elements, a 64-bit element, (2), a four-character vector consisting of 16 8-bit elements, eight 16-bit elements, four 32-bit elements, two 64-bit elements.

Normal instruction in neon, wide instruction, narrow instruction, saturation instruction, long instruction: (1), normal instruction: Generate the same size and type usually the same as the operand vector, (2), long instruction: Perform operation on the double-word vector operand, produce the result of four-word vector. The resulting element is typically twice times the width of the operand element, and is of the same type, (3), wide instruction: a double-word vector operand and a four-word vector operand to perform the operation, generating four-word vector results. The resulting element and the element of the first operand are twice times the width of the second operand element, (4), the narrow instruction: four-word vector operand performs the operation, and produces a double-character vector result, the resulting element is generally half the width of the operand element; (5), Saturation directive: When the range specified by the data type is exceeded, it is automatically restricted to that range.

Neon scalar: Some neon directives can handle scalars used in combination with vectors. The neon scalar can be 8-bit, 16-bit, 32-bit, or 64-bit. In addition to the multiplication instruction, access to scalar instructions can also access any element in the Register group. Instruction syntax refers to a scalar by using an index in a double-character vector, so that dm[x] represents the X-element in the DM. The multiplication instruction allows only 16-bit or 32-bit scalars, and only the first 32 scalars in the Register group can be accessed. This means in the multiplication instruction that: (1), 16-bit scalar is limited to register D0-D7, where x is within range 0-3, (2), 32-bit scalar is a register d0-d15, where x is 0 or 1.

Polynomial algorithm on {0,1}: Use Boolean algorithm rules to process coefficients 0 and 1: (1), 0+0=1+1+0, (2), 0+1=1+0=1, (3), 0*0=0*1=1*0=0, (4), 1*1=1. That is, the addition of the polynomial on the two {0,1} is the same as the bitwise XOR operation, and the multiplication of the polynomial on two {0,1} is the same as the integer multiplication, but the partial product performs an XOR operation rather than a summation operation.

Neon Note: (1), load data, the first load will put the data in the cache, as long as the cache size, the next time load the same data will be much faster than the first load, will be directly from the cache load data; (2), There will be about 2 clock blocking times when you do the neon multiplication instruction, and if you want to use the result of multiplication immediately, it will be blocked here. The results of multiplication cannot be used immediately, and some other operations can be inserted behind the multiplication without time consumption, (3), when the saturation instruction is used, when the multiplication is saturated, when the multiplication is done, the saturation is done once more, and the time is slower than the multiplication directly; (4), In load or store operations on 16-bit data, it is important to note that byte shifts are required.

The neon directive applies only to systems that support neon. ARMV7-M does not support neon.

The neon instruction set is only a subset of the arm and thumb instruction sets. Neon's instructions start with a V-letter. Using intrinsics (inline functions) is not as efficient as using assembly optimization. These functions are converted directly to neon assembly instructions at compile time. In order to support these inline functions, it is necessary to include the header file Arm_neon.h, and the use of neon technology can be achieved by adding-mfpu=neon at compile time. The use of intrinsics does not control register allocation and memory alignment.

Neon technology is only available for ARM CORTEX-A series processors. CORTEX-A Series processors: ARMCORTEX-A series CPU processor cores include ARMCORTEX-A5, ARM-A7, arm cortex-a8, arm cortex-a9 MPCore, arm cortex-a9 single-core processors, ARM cortex-a15 MPCore. The ARM cortex-a series is a series of application processors for complex operating systems and user applications. The CORTEX-A series processors support arm, thumb, and Thumb-2 instruction sets.

The ARM Cortex series processor cores include the CORTEX-A series (high performance, with MMU, can allow operating systems such as Symbian, Linux, Android, Windows CE, etc.), Cortex-r series ( High-end embedded to meet the real-time demand of high performance and reliability, cortex-m (embedded microcontroller, low power consumption, low cost).

The basic features of the ARM Cortex-a series processor (CORTEX-A5,CORTEX-A7,CORTEX-A8,CORTEX-A9,CORTEX-A15) based on the ARMV7-A architecture can basically support arm, Thumb-2, Thumb instruction set, support for Java-accelerated extended Jazelle technology, Thustzone Security extensions, VFP hardware extensions for floating-point FPU and parallel multi-data SIMD Neon Multimedia processor extensions, support for mainstream embedded OS (Symbian, Linux , Android, WindowsMobile, Windows Phone), Support Branch prediction branchprediction. But each processor in the Vfp/neon type, semi-precision floating point (16-bit half precision floating-point) support, multicore MPCore, pipelined pipeline, single MHz processing performance, L1/l2cache controller, disorderly execution, instruction Dual-issue concurrency, and so on are slightly different.

Cortex-a Processor commonality: (1), armv7-a architecture, (2), support for all operating systems: A, Linux full distribution----Android, Chrome, Ubuntu and Debian;b, Linux Third party----MontaVista, QNX, Wind river;c, Symbian;d, Windows ce;f, other operating systems that require the use of memory management units, (3), instruction set support: ARM, Thumb-2 ( Provides optimal code density and performance Mix), Thumb, Jazelle, DSP, (4), truszone Security extension, (5), VFP advanced single-precision and double-precision floating-point support, (6), Neon Media processing engine, (7), Support Branch prediction Branch Prediction.

     cortex-a5 Arm Core processor : Cortex-a5 processor supports the features of the ARMV7-A architecture, Includes trustzone security extension neon Multimedia processing engine, chip area and power consumption characteristics is good, but processing performance for other cortex-a slightly worse, such as only the equivalent of Cortex-a8 80% performance, cortex-a15 half performance. CORTEX-A5 can support multiple cores. The CORTEX-A5 processor supports dual-dual issue and branch prediction Branch prediction. Hardware options for Neon and VFP. The CORTEX-A5 supports arm and thumb instruction sets and can contain Java acceleration technologies for jazelle-dbx and JAZELLE-RCT. The CORTEX-A5 processor is the smallest, lowest power (down to 0.08mw~0.12mw/mhz) ARM multicore processor that provides Internet access to the widest range of devices: ultra-low-cost handsets, featured phones and smart mobile devices, and commonly used embedded, Consumer and industrial equipment. CORTEX-A5 processor applications are fully compatible with CORTEX-A8, CORTEX-A9, and CORTEX-A15 processors, enabling immediate access to recognized development platforms and software architectures, including Android, Adobe Flash, Java Platform Standard Edition (Java SE), JavaFX, Linux, Microsoft Windows Embedded, Symbian, and Ubuntu. The full application compatibility of the CORTEX-A5 with Cortex-a8, CORTEX-A9, and CORTEX-A15 processors provides a high-value migration path for a large number of existing  arm926ej-s and  arm1176jz-s processor licensees. The Cortex-a5 power and area is only 1/3 of cortex-a9 and has full instruction set compatibility. attribute Keywords: VFP, NEON, Jazelle RCT, Thumb/thumb-2, 1–4 cores,variable (L1+L2) Cache, Mmu+trustzone.

cortex-a7 ARM Nuclear processor : the power and area of the CORTEX-A7 processor is similar to the ultra-efficient CORTEX-A5, but the performance boost 15~20%,cortex-a7 is the small core part of the arm's size core design, and with the high-end CORTEX-A15 CPU Architecture is fully compatible. The CORTEX-A7 processor includes all the features of the high-performance processor CORTEX-A15, including virtualization (virtualization), large-capacity physical memory address extensions (Large Physical Address Extensions (LPAE), can be addressed to 1TB of storage space), NEON, VFP, and Amba 4 ACE coherency (AMBA4 cachecoherent Interconnect (CCI)). The CORTEX-A7 supports the design of multicore MPCore and the Big+little size core design. The small, energy-efficient CORTEX-A7 is ideal for standalone CPUs in the latest low-cost smartphones and tablets, and is available in big. Combined with CORTEX-A15 in the LITTLE processing configuration. Feature keywords: VFPv4 FPU, NEON, Thumb-2, jazellerct/dbx, Out-of-order speculative issue superscalar, Large physical Addressextensi ONS (LPAE), Hardware virtualization, 1–4 SMP cores, 32kb/32kb L1, up TO4MB L2, Mmu+trustzone.

cortex-a8 ARMNuclear processor: The CORTEX-A8 processor is the first processor to use the ARMV7-A architecture. Many application processors take cortex-a8 as the core, such as s5pc100 (Samsung), OMAP3530 (Ti,texas Instruments), i.mx515 (Freescale). The CORTEX-A8 processor is an ordered superscalar processor with dual instruction execution, providing 2.0 Dhrystone MIPS (per MHz) for highly optimized energy efficiency implementations that provide the high level of performance required for devices based on traditional single-core processors. CORTEX-A8 has built a ARMV7 architecture in the marketplace that can be used in a variety of applications, including smartphones, smart books, portable media players, and other consumer and enterprise platforms. Separate L1 instruction and data cache size can be 16KB or 32KB, instruction and data sharing L2 cache, capacity up to 1MB. The cache data for the L1 and L2 caches is 128 bits wide, L1cache is a virtual index, is physically contiguous, and L2 completely uses the physical address. The Cortex-a8 L1 cache line width is 64byte,l2cache integrated on-chip. In addition, compared with CORTEX-A9, because the CORTEX-A8 support floating point VFP operation is very limited, its VFP speed is very slow, often the same floating point operation, its speed is cortex-a9 1/10. Cortex-a8 can concurrency certain neon instructions (such as Neon load/store and other neon instructions), and cortex-a9 because the neon bit-width limit cannot be concurrent. CORTEX-A8 's neon and arm are separate, that is, the arm core and Neon core execution pipeline separate, neon access arm register quickly, but the arm side need to neon register data will be very slow. Feature keywords: VFP, NEON, Jazelle RCT, Thumb-2, 13-stage superscalarpipeline,variable (L1+L2) Cache, Mmu+trustzone. Devices that use CORTEX-A8: including Apple's Ipad1 (Apple A4 processor), BeagleBoard (Ti OMAP3530 or ti DM 3730). Htcdesire, SBM7000, Oregon State University OSWALD, Gumstix overo Earth, pandora,apple iPhone 3GS, Apple iPod Touch (3rd A nd 4th Generation), Apple IPad (A4), Apple IPhone 4 (A4), Archos 5, BeagleBoard, Motorola Droid, Motorola Droid X,motorola droid 2, M Otorola Droid R2D2 Edition, Palm Pre, samsung Omnia Hd,samsung Wave S8500, samsung i9000 Galaxy S, Sony Ericsson Satio, to Uch Book,nokia N900, Meizu M9, Google Nexus S, Sharp pc-z1 "Netwalker.

     cortex-a9 ARM nuclear processor : CORTEX-A9 MPCore or single-core processor single MHz performance is higher than CORTEX-A5 or CORTEX-A8, supporting arm, Thumb, Thumb-2, TrustZone, Jazellerct,jazelle dbx technology. The cache controller of the L1 provides hardware cache consistency maintenance that supports multi-core cache consistency. The L2 cache controller outside the kernel (l2c-310, or PL310) supports caches up to 8MB. The L1cache line width of the cortex-a9 is 32BYTE,L2 cache for multicore reasons, i.e., to access the multi-core shared L2 cache through the SCU. Feature keywords: application profile, VFPv3 FPU, NEON, Thumb-2, Jazelle rct/dbx,out-of-order speculative issue superscalar, 1–4 core SMP, 32kb/32kb L1, up TO4MB L2, Mmu+trustzone. Devices using CORTEX-A9: including Nvidia's dual-core Tegra-2, as well as TI's OMAP4 platform, Apple's ipad2 (AppleA5 processor), LG Optimus 2X (nVidiaTegra-2), Samsung Galaxy S II (Samsung Exynos 4210), Sony NGP psp2,pandaboard (TI OMAP4430 or Tiomap 4460), Motorola Atrix 4g,motorola DROID BIO Nic,motorola Xoom.

Cortex-a15arm Nuclear processors : The CORTEX-A15 MPCore processor is the highest performing processor in the Cortex-a family, and one outstanding feature is its hardware virtualization technology (Hardware virtualization) and the expansion of large physical memory ( Large Physical Address Extension (LPAE), can be addressed to 1TB of memory). Feature keywords: application profile, VFPv4 Fpu,neon, Thumb-2, Jazelle rct/dbx, Out-of-order speculative issue superscalar, largephy Sical Address Extensions (LPAE), Hardware virtualization, 1–4 SMP cores,32kb/32kb L1, up to 4MB L2, Mmu+trustzone. Devices using CORTEX-A15: The current integrated CORTEX-A15 processors are produced only by Samsung's Exynos 5 series processors, but TI's OMAP5 series processors also use CORTEX-A15 cores.

The neon pipeline is different on the cortex-a8 and CORTEX-A9 processors, and the neon directives are usually emitted in a single cycle, but the execution results can be valid for several cycles, and only simple such as vsub, Vadd, and vmov instructions can be used in the next cycle. Saving data to arm registers from neon registers is time-consuming and requires a delay of at least 20 cycles, thus avoiding such operations as much as possible. Try to avoid arm and neon processors accessing the same data region.

Not all armv7-based Android devices would support NEON.

Define Local_arm_neonto ' true ' in your module definition, and THENDK would build all their source files with NEON support. This can is useful ifyou want to build a static or shared library, that specifically contains neoncode paths.

To compile a file with NEON (ARM NEON intrinsics) with the GCC compiler, you need to add-mfpu=neon.

The NDK supports the compilation of modules or evenspecific source files with support for NEON. What's this means was that a Specificcompiler flag would be used to enable the use of GCC ARM Neon intrinsics andvfpv3-d32 at The same time.

Neon support for works when targeting the ' armeabi-v7a ' ABI, otherwise the NDK build scripts would complain and abort. ITIS important to use checks like the following in your ANDROID.MK:

# define a static library containing Ourneon code

Ifeq ($ (Target_arch_abi), armeabi-v7a)

Include $ (clear_vars)

Local_module: = Mylib-neon

Local_src_files: = mylib-neon.c

Local_arm_neon: = True

Include $ (build_static_library)

endif #TARGET_ARCH_ABI = = armeabi-v7a

Not all armv7-based ANDROID DEVICES would support NEON! Itis thus crucial to perform runtime detection to know if the Neon-capablemachine code can is run on the target device. To does that, use the ' cpufeatures ' library, which comes with this NDK. You should explicitly check thatandroid_getcpufamily () returns Android_cpu_family_arm, and Thatandroid_getcpufeatures ( ) Returns a value that have Theandroid_cpu_arm_feature_neon flag set, as in:

#include <cpu-features.h>

...

...

if (android_getcpufamily () ==android_cpu_family_arm &&

(Android_getcpufeatures () &android_cpu_arm_feature_neon)! = 0)

{

Use neon-optimized Routines

...

}

Else

{

Use Non-neon fallback routinesinstead

...

}

...

Unlike Android versions, the arm_neon.h is slightly different, and the higher the version, the more inline functions it contains.

Intel Atom processor can be applied to Android, x86, support TBB, support SSE2, SSE3, etc.

Neon libraries: (1), Projectne10, (2), OpenMAX, (3), FFmpeg, (4), Eigen3, (5), Pixman, (6), x264, (7), Math-neon, (8), Libjpeg-turbo, (9), Android Skia.

NEON C + + intrinsics is available in ARMCC, Gcc/g++,and LLVM.

When you compile a file, the compiler must know what processor you want the code to Runon. The primary option for doing this is–mcpu=cpu-name,where Cpu-name is the nameof. If you don't specify the processor to USE,GCC would use itsbuilt-in default. The default can vary depending on what the compiler wasoriginally built and the generated code might not execute or might E Xecuteslowly on the CPU so you have.

There is no support for NEON instructions Inarchitectures before ARMv7.

Load and store addresses must is aligned to cachelines to allow more efficient memory access. This requires is least 16-wordalignment on cortex-a8. If It isn't possible to align the start of the Inputand output arrays and then it's better to process the unaligned element s assingle elements. This means some of the elements in the beginning of the Arrayand some of the elements at the end of the array can be PROCE Ssed as singleelements.

Neon inline functions include: addition, multiplication, rounding, subtraction, comparison, absolute difference, max, min, logical operation, get Lane value, set lane value, Merge, detach, type conversion, look up table, Load, store, and so on, a total of nearly 1900 instructions.

Reference documents:

1. "NEON Programmer ' s Guide"

2, http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html

3, http://hilbert-space.de/?p=22

4, http://www.in189.com/thread-727738-1-1.html

5. "Arm NEON support in the arm compiler"

6, https://software.intel.com/en-us/blogs/2012/12/12/ From-arm-neon-to-intel-mmxsse-automatic-porting-solution-tips-and-tricks

7, http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0491c/BABDFJCI.html

Neon Introduction "Turn"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.