Example of code optimization using C language call assembly and instruction set

Source: Internet
Author: User

Build Environment

The idea of using assembly optimization on x264 is to compile assembly code into a static library for calling C code. Therefore, you must first construct a static library for compilation functions. Because it is quite troublesome to manually configure yasm to compile the assembly file and generate a lib, I chose cmake to build it.

In the demo, there is a sum. asm assembly file, which contains all the assembly functions. After yasm is compiled, sum is generated. obj, and then use sum. obj to create a sum. the lib library is used by C code. There is also a main. The C file of C is used to generate the executable mainfile. The cmakelists.txt file is as follows:

Compile (VERSION 3.0.00) project (asm) find_program (YASM_EXECUTABLE NAMES yasm yasm-1.2.0-win32 yasm-1.2.0-win64 HINTS $ ENV {YASM_ROOT} $ {YASM_ROOT} PATH_SUFFIXES bin) set (FLAGS-f win64-DARCH_X86_64 = 1) add_custom_command (OUTPUT sum. obj COMMAND $ {YASM_EXECUTABLE} ARGS $ {FLAGS }.. /source/sum. asm-o sum. obj DEPENDS sum. asm) # add the STATIC library sumadd_library (sum STATIC sum. obj sum. asm) set_target_properties (sum PROPERTIES LINKER_LANGUAGE C) # add the executable program mainadd_executable (main. c) target_link_libraries (main sum)


Find_program is used in system environment variables to check whether yasm assembler exists.

Note that in COMMAND $ {YASM_EXECUTABLE} ARGS $ {FLAGS }.. /source/sum. asm-o sum. the path of the assembly file specified in obj must be relative to the relative path of the project file, so here is .. /source/sum. asm

Now that the environment has been set up, you can use cmake to generate the required Assembly lib project and call the assembly function to get the executable file project. As shown in the following figure


About assembly

First, write a simple example (this is a 64-bit assembly). Assume that the main function needs to sum two numbers. The code is as follows:

Int sum (int a, int B); // This function implements int main (int argc, char * argv []) {int num = sum (2, 3) through assembler ); return 0 ;}


The code for implementing the sum function in the compilation is as follows:

Global sumsum: add ecx, edx; directly use the parameters mov eax, ecx ret in the ecx and edx registers.


This is one of the simplest ways to call the demon function in C. When writing an assembly function, you encounter the following problems:

When I learned 32-bit assembly before, the function parameter is passed through the stack. In the 64-bit assembly, the first four parameters are passed through the registers ecx, edx, r8, and r9. They are passed through the stack only when the number of parameters is greater than four, the values in the ecx and edx registers are directly used in the assembly code above.

If the number of function parameters is greater than 4, assume that the C code is as follows:

Int sum (int a, int B, int c, int d, int e); int main (int argc, char * argv []) {int num = sum (2, 3, 4, 5, 6); return 0 ;}


The compilation code is as follows:

Global sumsum: add rcx, rdx add rcx, r8 add rcx, r9 mov rdx, [rsp + 40]; extract 5th parameters from the stack and put them in the rdx register add rcx, rdx mov rax, rcx ret


Note that the first four parameters are transmitted through Registers. When the fifth parameter is retrieved from the stack, it is not obtained from rsp + 8, it is obtained from the rsp + 40 (40 = 4*8 + 8), which indicates that although the first four parameters are transmitted through registers, they still occupy the corresponding space in the stack, I understand this to ensure compatibility between _ stdcall and _ cdecll.

Use Instruction Set optimization (sse avx, etc)

First, let's take a look at the SIMD register.


The SIMD registers used by SSE are 128bit, with a total of 16, from XMM0 to XMM15.

The SIMD registers extended by AVX are 256bit and 16 in total. From YMM0 to YMM16, AVX can also use the XMM registers of SSE.

In AVX2.0, registers are extended to bits, with a total of 32 registers, from ZMM0 to ZMM31.

Suppose our main function is to sum two arrays. The code is as follows:

# Define N 8int sum (float a [], float B []); int sum_c (float a [], float B []) {for (int I = 0; I <N; I ++) {a [I] + = B [I];} return 0;} int main (int argc, char * argv []) {float a [N] = {2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0}; float B [N] = {9.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0}; // add the data in array B [N] to sum_c (a, B) in array a [n ); // sum (a, B) without assembly optimization; // return 0 with assembly optimization ;}


We can see that, without using assembly optimization, we need to calculate the sum of a [I] + B [I] in sequence in the sum_c function and save it in a [I.

If you use SSE instruction set optimization, the code is as follows:

Global sumsum: movups xmm0, [rcx] movups xmm1, [rdx] movups xmm2, [rcx + 16] movups xmm3, [rdx + 16] addps xmm0, xmm1 addps xmm2, xmm3 movups [rcx], xmm0 movups [rcx + 16], xmm2 ret

    
We can see that the sum of the eight numbers in a [8] and B [8] can be calculated only after two addition operations, here we need to perform two calculations because the xmm register is bits, so we can only calculate four float data at a time, and eight data scores are calculated twice.

The AVX instruction set optimization code is as follows:

Global sumsum: vmovups ymm1, [rcx] vmovups ymm2, [rdx] vaddps ymm0, ymm1, ymm2 vmovups [rcx], ymm0 ret


Because AVX uses the-bit ymm register, 8 32bit float data can be processed at a time, and two groups of 8 float data can be separately summed at a time.



Will the C language compiler automatically optimize specific instruction sets in assembly language?

I mean, if I write a C language code to implement an algorithm and my computer supports the SSE4 instruction set, the compiler automatically mines the parallel part of the algorithm during compilation, can I use SSE4 parallel instruction mining to generate assembly code and executable question parts that can be executed in parallel?

The compiler uses gcc or vc ++

Submitted

It depends on the compiler you are using. View the help documentation of the compiler, which will tell you which instruction sets it supports and what possible optimizations it can make.

Different compilers are different.

Supplement: GCC is not clear. You don't even mention the VC ++ version. Khan, VC6 does not support SSE, and VC6SP5 needs to be installed.
Both VS2005 and VS2008 support SSE. The best optimization for SSE/MMX instruction sets is Intel's c ++ compiler.

For parallel and high-performance computing, Fortran has great advantages. In particular, the new features of Fortran2003 have made many special settings for parallel computing. Intel also has a Fortran compiler.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.