One of my interesting programming directions is the large number of calculations, so in assembly language to write a lot of large number of computational aspects of small programs, and last week suddenly came up with a SSE2 instruction to convert integers into 16 strings of a good idea, then put into implementation. Originally thought can speed up to 500%, that test found that, compared to the original C language version, the speed has been increased by more than 20 times times, excited, so there is this blog post.
This program mainly demonstrates the ability to convert 64bit an integer into a 16-character string, with simple functions and algorithms. I believe many people have written a similar program, but I don't know if anyone has tried to optimize it. This demonstration program consists of 3 C language editions and one assembly language version using the SSE2 directive. Here we give the code and description.
First look at the original version of this function, UINT64_2_HEXSTRING_C1, for performance purposes, we use the __fastcall function convention, __fastcall interface functions using registers to pass parameters, exempt from the call when the cost of the stack, and the function can be adjusted to save /restore registers, and so on.
This is the C language plain version, directly using 64-bit integer logic instruction and arithmetic instruction void __fastcall uint64_2_hexstring_c1 (UINT64 *p, char *buff) { UINT64 x=*p; int i; for (i=15;i>=0;i--) { char c= (x & 0xf) + ' 0 '; if (c> ' 9 ') c+=7; Buff[i]=c; x>>=4; } buff[16]=0;}
Although the above function is simple, the speed is still not ideal. We know that in a 32-bit runtime environment, the C language statement for a 64-bit integer is translated into multiple instructions, so the speed is slower, the following version uses a full 32-bit integer processing, so faster than the above version.
This is a C-language improved version, using 32-bit integer logic instruction and arithmetic instruction void __fastcall uint64_2_hexstring_c2 (UINT64 *p, char *buff) {DWORD *pdw= (DWORD *) P;dword X; int I;for (x=pdw[1],i=7;i>=0;i--) {char c= (x & 0xf) + ' 0 '; if (c> ' 9 ') c+=7;buff[i]=c;x>>=4;} for (x=pdw[0],i=7;i>=0;i--) {char c= (x & 0xf) + ' 0 '; if (c> ' 9 ') c+=7;buff[8+i]=c;x>>=4;} buff[16]=0;}
The above function first converts a 64-bit integer address to the address of a 32-bit integer and then uses a 32-bit integer operation. The code is complex, but faster. However, the program still has room for optimization.
We notice that the above function consists of 2 loops, and in each loop there is an if statement, which stands in assembly language view, and the loop and if statements are branching statements that can be compiled into a comparison jump instruction pair. Branching is a hassle for CPUs, and since modern CPUs are widely used in pipeline technology, the following instructions have been taken and even decoded when executing the current instruction. When the CPU encounters a branch, it needs to predict which branch is most likely to be executed, and then load the code of the branch that is most likely to be executed into the pipeline. When the branch predicts success, all instructions can be executed smoothly and without pause. Once the branch prediction fails, you have to discard all the loaded instructions in the pipeline and re-refer and decode from the correct branch point. The technology of branch prediction is very complex, and the complete description requires a book. We are only here for a brief introduction. The implementation of branch prediction is usually the case when a branch is first encountered and a non-jump branch is executed. The actual execution (execution of that branch) is stored in the history each time the branch instruction is executed. When you encounter this branch later, you can tell from the history that the branch is most likely to be executed. One of the simplest judgment algorithms is that the branch that always predicts a higher execution probability is the most effective branch prediction scheme for a loop-induced branching. For example, a loop number of n for the Loop, the first n times from the bottom of the loop is always jump to the head, only the last loop does not jump, in other words, the Jump branch execution probability is much higher than the non-jump branch, so the CPU always predict the jump branch. For the above example, two loops are a fixed number of cycles with few cycles. In this case, the compiler can use the loop unwind method to eliminate the branch, but for a branch of the statement "if (c> ' 9 ') c+=7", the branch prediction technique is difficult to work, the random number between 0-15 and the probability of 9 is greater than 37.5%, even if the CPU always predicts <= ' 9 ' Of that branch, there is also a 37.5% probability of predicting failure, branch prediction failure, CPU needs to pay a considerable price, need a few or even 10 extra cycles.
The technique I use for the next version is branch-removal, which increases the speed of function execution by eliminating branches, and, in order to eliminate branching, has to use extra statements, which are much faster, even though the code is a lot more.
This is the C language version using the de-branching technique, at I7-4700HQ, the speed is 2.6 times times the previous version of void __fastcall uint64_2_hexstring_c3 (UINT64 *p, char *buff) {DWORD *pdw= ( DWORD *) P;dword x;int i;for (x=pdw[1],i=7;i>=0;i--) {char c= (x & 0xf) + ' 0 '; char mask= 0-(c> ' 9 ');//the flag is 0 Xff when c> ' 9 ' buff[i]= C + (Mask & 7); x>>=4;} for (x=pdw[0],i=7;i>=0;i--) {char c= (x & 0xf) + ' 0 ', char mask= 0-(c> ' 9 ');//the flag is 0xff when c> ' 9 ' buff[i+ 8]= C + (Mask & 7); x>>=4;} buff[16]=0;}
Here is the ultimate version, using the SSE2 instruction assembly version, the direct use of the ALU Instruction programming optimization is limited, or even less than the compiler. We use the SSE2 directive directly here, the SSE2 instruction mainly uses the XMM register to work, the 1 XMM registers can be regarded as 4 dword,8 word,16 byte,1 The number of UINT64 bits converted to a string 16 characters, can be installed in an XMM register, So this job is just right for the SSE2 directive. The following assembly version of the code, with a few 16-byte arrays, requires 16 direct alignment, although in the Assembly can also define 16-byte alignment, but we here put the definition of the array in the C file, the benefits of placing in the C file is easy to expand, such as the C file can be defined in 32-byte alignment, I tried to define 32-byte alignment in the assembly file, but the assembler always gave an error. This gives the definition of a constant array in C.
#include <stdio.h> #include <stdlib.h>typedef unsigned char BYTE; #if defined (__gnuc__) //GCC #define INTRIN_ALIGN (n) __attribute__ ((aligned (n))) #else #define INTRIN_ALIGN (n) __declspec ( Align (n)) #endif //#if defined (__gnuc__) //Gccintrin_align (+) BYTE i256_num_0s[16]={' 0 ', ' 0 ', ' 0 ', ' 0 ', ' 0 ', ' 0 ', ' 0 ', ' 0 ', ' 0 ', ' 0 ', ' 0 ', ' 0 ', ' 0 ', ' 0 ', ' 0 ', ' 0 ',};intrin_align (+) BYTE i256_num_9s[16]={' 9 ', ' 9 ', ' 9 ', ' 9 ', ' 9 ', ' 9 ', ' 9 ', ' 9 ', ' 9 ', ' 9 ', ' 9 ', ' 9 ', ' 9 ', ' 9 ', ' 9 ', ' 9 ',};intrin_align (+) BYTE i256_num_full7[16]={7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, };
The following is the Microsoft Assembler Ml.exe Format assembly language source code.
.686P. XMM.MODELFLAT_DATASEGMENTEXTRN _i256_num_0s:byteextrn _i256_num_9s:byteextrn _i256_num_full7:byte_dataendspublic @ [email protected]_textsegment; The XMM register definition for function _uint64_2_hexstring_sse2xmm_r_db_str_0 textequ <xmm2>xmm_r_db_str_9 TEXT EQU <xmm3>xmm_r_db_7textequ <xmm4>xmm_r_tmptextequ <xmm1>@[email protected] PROCmovq XMM0, Mmword ptr [ECX]PSHUFD XMM0, XMM0, 11001101b;now XMM0 contain 2 QWORD and low 32bits in every QWORD are valid; We denote the value of the XMM register with a string, "N" means a byte, "0" means a byte whose value is 0,; We use low-bytes First order,; Now, the value of XMM0 is (nnnn0000,nnnn0000), contain 2 QWORD and contain with bits in every QWORD; The first round transform, 2 QWORD = 4 Dwordmovdqa xmm_r_tmp, Xmm0psllq XMM0, 48;toward high shift bits; Now value of XMM0 are (000000NN, 000000NN), now bit0-bit15 of every original QWORD are in XMM0 psrlq xmm_r_tmp, 16;toward Low SHIFT + bits; NoW Value of XMM1 is (nn000000,nn000000), bit16-bit31 of every original QWORD are in XMM1PSRLQ XMM0, 16;toward low shift 1 6 bits; Now value of XMM0 are (0000nn00,0000nn00), now bit0-bit15 of every original QWORD are in XMM0 PORXMM0, Xmm_r_tmp;merge bi T0-bit15 and Bit16-bit31; Now the value of XMM0 is (nn00,nn00,nn00,nn00), contain 4 DWORD, and the bits in every DWORD is valid; The second round transform, 4 DWORD = 8 Wordmovdqa xmm_r_tmp, Xmm0pslld XMM0, 24;toworad high shift bits; Now value of XMM0 are (000n,000n,000n,000n), the bit0-bit7 of every original DWORD is in XMM0 psrld xmm_r_tmp, 8;toworad Low shift 8 bits; Now value of XMM0 are (n000,n000,n000,n000), the bit8-bit15 of every original DWORD is in Xmm1psrld XMM0, 8;toworad Low Shift 8 bits; Now value of XMM0 are (00n0,00n0,00n0,00n0), the bit0-bit7 of every original DWORD is in XMM0 PORXMM0, Xmm_r_tmp;merge bi T0-bit7 and bit8-bit16; Now value of XMM0 was (n0,n0,n0,n0,n0,n0,n0,n0), contain 8 WORD, and 8 bits in EVery WORD is valid; The third round transform, 8 WORD = + Bytemovdqa xmm_r_tmp, xmm0psllw XMM0, 12;toworad high shift n bits; The bit0-bit3 of every original WORD is in XMM0 psrlw xmm_r_tmp, 4;toworad low shift 4 bits; Now the BIT4-BIT7 or every original WORD are in XMM1PSRLW XMM0, 4; Toworad low shift 4 bits; The bit0-bit3 of every original WORD is in XMM0 PORXMM0, Xmm_r_tmp;merge bit0-bit3 and bit4-7; Now value of XMM0 is (N,n,n,n,n,n,n,n,n,n,n,n,n,n,n,n,) contain-byte, the value range of every byte is 0-15; Pre-load array to XMM registermovdqa xmm_r_db_str_0, Xmmword ptr _i256_num_0smovdqa xmm_r_db_str_9, Xmmword ptr _i256_n Um_9spor XMM0, Xmm_r_db_str_0;now the byte[i] of XMM0 are ' 0 ' to ' 0 ' +15movdqa xmm_r_tmp, XMM0; now the byte[i] of xmm_ R_tmp is ' 0 ' to ' 0 ' +15movdqa xmm_r_db_7,xmmword ptr _i256_num_full7;now xmm_r_db_7 are full 7PCMPGTB xmm_r_tmp, xmm_r_db_ Str_9; Byte[i] of Xmm_r_tmp (0<=i<15) is-1 if byte[i]> ' 9 ' pand xmm_r_tmp, xmm_r_db_7; BYTE[I] of xmm_r_tmp (0<=I<15) is 7, if byte[i]> ' 9 ' paddb XMM0, xmm_r_tmp;final result,now the byte[i] of XMM0 are ' 0 '-' 9 ' or ' A ' to ' F ' movdqu xmmword ptr [edx], Xmm0mov byte ptr [edx+16], 0ret0@[email protected] Endp_textendse ND
I did not translate the English note into Chinese. This is because, for the assembly code, the master does not speak, beginners do not look, so here will no longer give more instructions.
All the code in the main program is given below.
#include <stdio.h> #include <stdlib.h> #include <string.h>/* This program demonstrates the use of optimization techniques to convert 64-bit integers to 16 binary strings. Includes 3 versions of the C language and a compiled version using the SSE2 directive. Tested on my Computer (I7-4700HQ), the assembly version is 20 times times the speed of the C language, and is 7.5 times times the version of the branch elimination technology. Baocheng Liang Completion Date: 2015-7-30, All rights reserved. */#define Array_len 4096#define loop_count2048typedef unsigned long long uint64;typedef unsigned long dword;typedef unsi gned short word;typedef unsigned char BYTE; UINT64 G_nums[array_len];char G_buff[array_len*16+16];extern Double currtime (); Use the high-precision timer extern void __fastcall uint64_2_hexstring_sse2 (UINT64 *p, char *buff);//Use the assembly version of the SSE2 directive extern void __fastcall UINT64_2_HEXSTRING_C1 (UINT64 *p, char *buff);//The most common C language version extern void __fastcall uint64_2_hexstring_c2 (UINT64 *p, char * buff);//The C language version of the most common 32-bit integer extern void __fastcall uint64_2_hexstring_c3 (UINT64 *p, char *buff);//C-language version typedef using spoke-cancellation technology void (__fastcall *lpfn_uint64_2_hexstring) (UINT64 *p, char *buff);//This is the C language Plain edition, directly using 64-bit integer logic instruction and arithmetic instruction void __fastcall UINT64 _2_HEXSTRING_C1 (UINT64 *p, char *buff) {UINT64 x=*p;int i;for(i=15;i>=0;i--) {char c= (x & 0xf) + ' 0 '; if (c> ' 9 ') c+=7;buff[i]=c;x>>=4;} buff[16]=0;} This is a C-language improved version, using 32-bit integer logic instruction and arithmetic instruction void __fastcall uint64_2_hexstring_c2 (UINT64 *p, char *buff) {DWORD *pdw= (DWORD *) P;dword X; int I;for (x=pdw[1],i=7;i>=0;i--) {char c= (x & 0xf) + ' 0 '; if (c> ' 9 ') c+=7;buff[i]=c;x>>=4;} for (x=pdw[0],i=7;i>=0;i--) {char c= (x & 0xf) + ' 0 '; if (c> ' 9 ') c+=7;buff[8+i]=c;x>>=4;} buff[16]=0;} This is the C language version using the de-branching technique, at I7-4700HQ, the speed is 2.6 times times the previous version of void __fastcall uint64_2_hexstring_c3 (UINT64 *p, char *buff) {DWORD *pdw= ( DWORD *) P;dword x;int i;for (x=pdw[1],i=7;i>=0;i--) {char c= (x & 0xf) + ' 0 '; char mask= 0-(c> ' 9 ');//the flag is 0 Xff when c> ' 9 ' buff[i]= C + (Mask & 7); x>>=4;} for (x=pdw[0],i=7;i>=0;i--) {char c= (x & 0xf) + ' 0 ', char mask= 0-(c> ' 9 ');//the flag is 0xff when c> ' 9 ' buff[i+ 8]= C + (Mask & 7); x>>=4;} buff[16]=0;} void Test_uint64_2_hexstring (lpfn_uint64_2_hexstring fp, const char *funcname) {UINT64 arr[]={0x0123456789abcdef,0x02468ace13579bdf,0xaaaaaaaaaaaaaaaa,0xffffffffffffffff,};int I;char Buff[17*4];memset (buff,0, sizeof (buff));p rintf ("Test function%s\n", funcName), for (i=0;i<4;i++) {fp (arr+i,buff+i*17);p rintf ("%s\n", Buff+i *17);}} void Perf_uint64_2_hexstring (lpfn_uint64_2_hexstring fp,const char *funcname) {int i,j; UINT64 X;char *p;double t;//initializes the global array g_nums for (i=0;i<array_len;i++) {p= (char *) (&X), for (j=0;j<8;j++) p[j]= ( Rand () & 0xff);//get a 64bit random numberg_nums[i]=x;} T=currtime ();//Timing begins g_buff[0]=0;for (i=0;i<loop_count;i++) {p=g_buff;for (j=0;j<array_len;j++) {fp (g_nums+j,p );p +=16;}} T= (Currtime ()-T) *1000000000;//conversion time to nanosecond printf ("It takes%.2f ns to run function%s\n", t/(Loop_count*array_len), funcName) ;p rintf ("strlen (Buff) is%d\n", strlen (G_buff)); }//Functional test void Test_function () {test_uint64_2_hexstring (UINT64_2_HEXSTRING_C1, "uint64_2_hexstring_c1"); test_UINT64_ 2_hexstring (UINT64_2_HEXSTRING_C2, "uint64_2_hexstring_c2"); Test_uint64_2_hexstring (UINT64_2_hexString_c3, "uint64_2_hexstring_c3"); Test_uint64_2_hexstring (Uint64_2_hexstring_sse2, "Uint64_2_hexstring_sse2");} Performance test void Perf_function () {perf_uint64_2_hexstring (UINT64_2_HEXSTRING_C1, "Uint64_2_hexstring_c1");p Erf_uint64_2 _hexstring (UINT64_2_HEXSTRING_C2, "uint64_2_hexstring_c2");p erf_uint64_2_hexstring (UINT64_2_HEXSTRING_C3, " Uint64_2_hexstring_c3 ");p erf_uint64_2_hexstring (Uint64_2_hexstring_sse2," Uint64_2_hexstring_sse2 ");} int main (int argc, char* argv[]) {test_function ();p erf_function (); return 0;}
Add:
This gives the code for the cross-platform timing function currtime.
#if defined (_win32) #include <windows.h>static large_integer freq;static bool Initfreq () {bool Ret;if (! QueryPerformanceFrequency (&freq)) {ret=false;} Else{ret=true;} return ret;} Double Currtime ()//Use high-precision timer {Large_integer performancecount; BOOL result;double time=0.0; BOOL bret=true;if (freq. quadpart==0) {bret=initfreq ();} if (BRet) {result=queryperformancecounter ( &performancecount); time= Performancecount.highpart * 4294967296.0 + performancecount.lowpart;time=time/( freq. Highpart * 4294967296.0 + freq. LowPart);} return time;} #elif defined (__linux__) #include <sys/time.h> #include <stdio.h> #include <stdlib.h>double currtime () {struct Timeval tv;gettimeofday (&TV, NULL); return (double) (tv.tv_sec) + (double) (tv.tv_usec)/1000000.00;} #else #error do not support this COMPLIER#ENDIF
About compiling: This program uses C and assembly language mixed programming. C language source files are added directly to the VC project can be. Assembly language uses the VC's own assembler Ml.exe to compile.
Method
1. Add the assembly file to the VC project.
2. Select the file, right-click the Properties menu, Item type, select the custom Build Tool. In command line column: enter "ml/coff/c% (FullPath)" and enter "% (Filename)" in the Outputs column. obj;% ( Outputs) "
Test results
function |
C1 |
C2 |
C3 |
SSE2 |
Time (nanoseconds) |
67.95 |
52.71 |
20.16 |
3.18 |
Relative speed |
100% |
129% |
335% |
2,138% |
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
An effective compilation optimization paradigm--using the SSE2 directive to optimize the conversion of a binary