[Reveal the Intel module of vc crt library]-strlen

Source: Internet
Author: User

Since it is the first article after the opening, Let's first look at a simple and practical function to enhance your confidence and then step by step to the complexity, it's easy to let it go.

I still remember that when I was a beginner in C, I had a very deep memory for a class of functions such as string operations. Various tests will examine the implementation of functions such as strlen and strlen, and I am looking for a job after graduation, it also includes the implementation of strlen, strcpy, and other functions. It can be seen that string operation functions are favored by teachers and companies. So this article will look at the strlen function!

Maybe you are already at Bs, and you want to study this thing. I can complete it instantly, So you wrote this Code:

int strlen( const char* str ){    int length = 0;    while ( *str++ )        ++length;    return ( length );}

Wow! You wrote down this concise and refined strlen in an instant. You passed the C language examination and the company's written examination. Congratulations. However, it seems that the problem has been solved so quickly. How can this article proceed? Let's take a look at the strlen you killed in a flash. She's perfect. It's exactly the same as Ms's engineers. In general, it's just a few lines of code. So, why can this problem be solved? Is there any better solution? You have the opportunity to come up with another one:

int strlen( const char* str ){    const char* ptr = str;    while ( *str++ )        ;    return ( str - ptr - 1 );}

The short code is not necessarily optimal. Of course, it cannot be involved in software engineering. We can see that STR ++ moves backward byte, the time complexity is O (n), so this strlen can be easily completed. What is the better solution? Imagine, if we can get a few bytes of a hop, we will not be able to get the length faster, but will not reduce the complexity? Wait and see.

This series aims to analyze the functions of the Intel module in the CRT library, so let's find out if there is any strlen implementation in it! It is located inVC/CRT/src/Intel/strlen. ASM. Open it and check it out. Sorry, it's a little dizzy. But the most eye-catching thing is that in the previous comment, Ms engineers wrote a "comment version" strlen, which is exactly the same as the strlen you previously implemented. However, it is an annotated version and will not be compiled into the program. The following Assembly implementation code is as follows:

        CODESEG        public  strlenstrlen  proc \        buf:ptr byte        OPTION PROLOGUE:NONE, EPILOGUE:NONE        .FPO    ( 0, 1, 0, 0, 0, 0 )    string  equ     [esp + 4]        mov     ecx,string              ; ecx -> string        test    ecx,3                   ; test if string is aligned on 32 bits        je      short main_loopstr_misaligned:        ; simple byte loop until string is aligned        mov     al,byte ptr [ecx]        add     ecx,1        test    al,al        je      short byte_3        test    ecx,3        jne     short str_misaligned        add     eax,dword ptr 0         ; 5 byte nop to align label below        align   16                      ; should be redundantmain_loop:        mov     eax,dword ptr [ecx]     ; read 4 bytes        mov     edx,7efefeffh        add     edx,eax        xor     eax,-1        xor     eax,edx        add     ecx,4        test    eax,81010100h        je      short main_loop        ; found zero byte in the loop        mov     eax,[ecx - 4]        test    al,al                   ; is it byte 0        je      short byte_0        test    ah,ah                   ; is it byte 1        je      short byte_1        test    eax,00ff0000h           ; is it byte 2        je      short byte_2        test    eax,0ff000000h          ; is it byte 3        je      short byte_3        jmp     short main_loop         ; taken if bits 24-30 are clear and bit                                        ; 31 is setbyte_3:        lea     eax,[ecx - 1]        mov     ecx,string        sub     eax,ecx        retbyte_2:        lea     eax,[ecx - 2]        mov     ecx,string        sub     eax,ecx        retbyte_1:        lea     eax,[ecx - 3]        mov     ecx,string        sub     eax,ecx        retbyte_0:        lea     eax,[ecx - 4]        mov     ecx,string        sub     eax,ecx        retstrlen  endp        end

Let's look at the assembly code of the main part.

First, it declares the public symbols of strlen and the function parameters of strlen, the option code is used to prevent the assembler from generating the start code and end code. (For details, refer to the relevant documents. FPO is related to framepointomission. The description in msdn is as follows:

FPO (cdwlocals, cdwparams, cbprolog, cbregs, fusebp, cbframe)

Cdwlocals: number of local variables, an unsigned 32 bit value.

Cdwparams: size of the parameters, an unsigned 16 bit value.

Cbprolog: Number of bytes in the function Prolog Code, an unsigned 8 bit value.

Cbregs: number of bytes in the function Prolog Code, an unsigned 8 bit value.

Fusebp: indicates whether the EBP register has been allocated. Either 0 or 1.

Cbframe: Indicates the frame type. Here you only need to pay attention to the second parameter, which is 1, indicating that there is a parameter. Strlen itself is also a parameter. For other parameters, it should be easy to read the above English comments. I will not explain them here. You can also click here to view details.

Follow these three sentences:

string  equ     [esp + 4]        mov     ecx,string              ; ecx -> string        test    ecx,3                   ; test if string is aligned on 32 bits        je      short main_loop   

First, esp + 4 is simple. In alloca insider of [dynamic stack memory allocation], I will explain it in detail, ESP + 4 is the address of the strlen parameter. This address belongs to the stack memory space and can be set to [esp + 4, the strlen parameter is directed to the address (the strlen parameter is const char *). If the code is like this:

char szName[] = "masefee";strlen( szName );

The address value obtained from [esp + 4] above is the first address of the szname array. The preceding string equ [esp + 4] does not generate any code, and the string is equivalent to a macro definition (as to why this string is needed, it will be known later. Believe it, all of this is well-founded, and this is one of the fun of research). Therefore, mov ECx and string are equivalent to mov ECx, [esp + 4], this statement directly assigns the address value pointed to by the parameter to the ECX register. ECx is the first address of the string at the moment. In the next sentence, test ECx, 3 is used to test whether the address value stored in ECx is 4-byte (32 bits) alignment. If yes, It is redirected to main_loop for execution. Otherwise, then proceed to the next step. Let's first look at the UN-aligned situation, which is naturally followed by the str_misaligned section:

str_misaligned:        mov     al,byte ptr [ecx]        add     ecx,1        test    al,al        je      short byte_3        test    ecx,3        jne     short str_misaligned        add     eax,dword ptr 0         ; 5 byte nop to align label below        align   16                      ; should be redundant

Without looking at this code, we can infer that the memory allocation is always aligned for the operating system, so when does strlen check alignment when it comes in? As follows:

Char szname [] = "masefee"; char * P = szname; P ++; // move P to a byte, which is assumed to be 4-byte aligned, after moving, the strlen (P) is no longer aligned with four bytes );

Of course, this is what I intentionally wrote here. There are other situations in reality. For example, a struct contains a string, which is aligned in one byte. When the position of the string is unknown, then the first address of the string may not be 4-byte aligned. Continue with the previous inference. If it is not alignment, it will first alignment, and then continue to evaluate the length. If it is found in the process of re-alignment, it will stop, returns the length immediately. Now, the inference is complete. Let's look at the assembly code above.

First, take a byte to the memory that ECx points to into Al, then add 1 to ECx and move one byte backward, and then judge whether Al is 0. If it is 0, it will jump to byte_3, otherwise, continue to test whether the current address value of ECx is aligned. If the address value is not aligned, continue to take a byte value, and add ECx until the alignment or an ending character is reached. When the end character is not met and the address value stored in ECx is aligned, add eax, dword ptr 0 in the following sentence, followed by a comment, indicating that the Code has no practical significance. Align 16 works with the previous add to align the code in 16 bytes, And the main_loop is the address starting with the 16-byte alignment (again, I feel the wisdom of Ms engineers, considerate ).

Next we should enter main_loop, which is obviously the meaning of the main loop and the core of strlen. Here we use a clever algorithm to analyze the First Half of the Code:

mov     eax,dword ptr [ecx]     ; read 4 bytesmov     edx,7efefeffhadd     edx,eaxxor     eax,-1xor     eax,edxadd     ecx,4test    eax,81010100hje      short main_loop

First, the first statement reads four bytes into the memory directed to ECx to eax. It is obvious that you want to process four bytes at a time. Then let's look at the second sentence and assign the edX value to 0x7efefefeff. What does this number look like? Let's take a look at the binary number:

01111110 11111110 11111110 11111111
Looking at the binary of this number, we noticed four red zeros. They all have a feature on the left of each byte. What is the purpose? When will it be modified on the left? It is obvious that when there is an increment on the right side, it will be changed to this 0, or the positions of these zeros will be changed when they are computed with another number. First, do not analyze it. First, let's take a look at the next sentence: Add EDX and eax. This sentence adds the four-byte integer derived from the memory directed by ECx to 0x7efefeff. It's strange, what is the significance of this addition? When you think about it, you are surprised. In this way, you can know which or which of the four-byte integers is 0. If the value is 0, the purpose of strlen is achieved. strlen is used to locate the Terminator and then return the length.
Let's look at this addition process. The purpose of addition is to change some 0 of the four red 0 above. If any 0 does not change and the maximum 0 does not change, it indicates that some or some of the four bytes are 0. These red zeros can be referred to as hole and are also very vivid. For example:

Byte3 byte2 byte1 byte0

???????? 00000000 ???????? ???????? // Eax

+ 01111110 11111110 11111110 11111111 // edX = 0x7efefeff
The preceding example assumes that the two numbers are added. The question mark represents 0 or 1, but the entire byte is not all 0. The byte2 of eax is all 0. It is added to byte2 of edX, regardless of how byte1 and byte0 are added, the final carry value can only be 1 at most, so the bytes of byte3 will never change. Similarly, if byte0 is 0, the bytes of byte1 can never be changed. Only one of byte0 is not 0, and the bytes of byte1 will receive the carry, that is why the byte0 of edX is 0xff. All bytes are judged by carry. As long as there is no carry on the right side, the byte must be 0.

To continue looking down, XOR eax and-1 are used to reverse eax (4 bytes obtained from the memory indicated by ECx. Then XOR eax, EDX, the intention of this sentence is to extract and execute the previousThe added value (add edX, the value of edX after eax) is not changed.To continue, add ECx, 4 indicates that the ECX is moved 4 bytes backward for the next computation. Then, in the test eax, 81010100 H, the 0x81010100 is the first 0x7efefeff inverse, that is, the location of several hole is 1. AndValue After addition (add
EdX, the value of edX after eax) not changed
The phase comparison: if the result is 0, it indicates the value after addition (add edX, the value of edX after eax) compare with the original value eax (4 bytes of the original string obtained), and with each 0 position (hole) relative to the four 0 (hold) positions in 0x7efefeff) all are changed (or the position of each 1 (hole) is changed relative to the four 1 (also hold) positions in 0x81010100 ); if the value is not 0 and the comparison is the same, the bytes are 0. From this point of view, the test with 0x81010100 is used to determine the hold location of the four bytes obtained from the string and the value after the addition of 0x7efefeff, which is relative to the Hole Location of the four bytes in the original, which hole locations are changed. If the location of each hole is changed, the test result is 0, indicating that no Bytes are 0. Otherwise, the bytes are 0.

When the byte is found to be 0, you should determine which byte is 0 for the 4 bytes obtained, as shown below:

mov     eax,[ecx - 4]test    al,al                   ; is it byte 0je      short byte_0test    ah,ah                   ; is it byte 1je      short byte_1test    eax,00ff0000h           ; is it byte 2je      short byte_2test    eax,0ff000000h          ; is it byte 3je      short byte_3jmp     short main_loop         ; taken if bits 24-30 are clear and bit                                ; 31 is set

As above, the reason for the first [ecx-4] Is because ECx adds 4 in front, so it needs to subtract 4 again to start the 4 bytes, and then determine the byte which is 0, the code is very simple, this is not detailed here. If a byte of 0 is found here, it will jump to the corresponding tail section, as shown below:

byte_3:        lea     eax,[ecx - 1]        mov     ecx,string        sub     eax,ecx        retbyte_2:        lea     eax,[ecx - 2]        mov     ecx,string        sub     eax,ecx        retbyte_1:        lea     eax,[ecx - 3]        mov     ecx,string        sub     eax,ecx        retbyte_0:        lea     eax,[ecx - 4]        mov     ecx,string        sub     eax,ecx        ret

Take byte_3 as an example, that is, out of the four bytes, 4th bytes is 0, the first three bytes is not 0, so eax should be equal to the ecx-1, then assign the value of ECx to the first address of the string (here you should understand why the macro string exists ). Finally, sub eax and ECx obtain the length of the string. Then RET is returned to the upper layer. The entire strlen is over.

Through the previous analysis, we have learned the principles of strlen and have a better understanding of the beauty of algorithms. We can translate the strlen of this assembly version into the C language version, as shown below:

Size_t strlen (const char * Str) {const char * PTR = STR; For (; (INT) PTR & 0x03 )! = 0; ++ PTR) {If (* PTR = '\ 0') return PTR-STR;} unsigned int * ptr_d = (unsigned int *) PTR; unsigned int magic = 0x7efefeff; while (true) {unsigned int bits32 = * ptr_d ++; If (bits32 + magic) ^ (bits32 ^-1 ))&~ Magic )! = 0) // bits32 ^-1 is equivalent ~ Bits32 {PTR = (const char *) (ptr_d-1); If (PTR [0] = 0) return PTR-STR; If (PTR [1] = 0) return PTR-str + 1; if (PTR [2] = 0) return PTR-str + 2; If (PTR [3] = 0) return PTR-str + 3 ;}}}

All right, the strlen analysis is almost complete, and the final C language version can be changed. For example, it can be customized based on the character's character set. However, this is generally not required. It is better to use it in general. I did a test to compare the performance of the C language version, the final C language version, and the Assembly version of CRT at the beginning of this article, and calculate the length of the same string for 10000000 times, enable O2 optimization. The average time consumed by the three methods is:

General C language version: 723 Ms

Later versions of C: 315 Ms

CRT assembly version: 218 Ms
It can be seen that the performance of the latter two has been improved. here we need to note that the strlen function of CRT belongsIntrinsic functionsThe so-called intrinsic function can be called as an internal function, which is a bit similar to the inline function, but not inline. Inline is not mandatory. It is also different in Compiler compilation. The intrinsic function is equivalent to determining whether to compile the function code at the Assembly level inline Based on the context and other conditions during compilation, and optimizing the function code at the same time, thus saving the function call overhead, at the same time, the optimization is more straightforward. The compiler is familiar with the internal functions of intrinsic functions, which are often called built-in functions. Therefore, the compiler can be better integrated and optimized for only one purpose. in a specific environment, select the optimal solution. Take strlen for example:

int main( int argc, char** argv ){    int len = strlen( argv[ 0 ] );    printf( "%d", len );    return 0;}

Disable Optimization under debug, or disable Optimization under releaseMinimum size (/O1)You can forcibly enable the intrinsic internal function option.(/OI)In this way, after the strlen function is enabled, it no longer calls the CRT assembly version function, but is directly embedded into the main function code, as shown below (disable optimization and enable internal functions under debug or release)(/OI)):

Int Len = strlen (argv [0]); 0042d8de mov eax, dword ptr [argv] 0042d8e1 mov ECx, dword ptr [eax] 0042d8e3 mov dword ptr [ebp-0D0h], ECX 0042d8e9 mov edX, dword ptr [ebp-0D0h] 0042d8ef add edX, 1 0042d8f2 mov dword ptr [ebp-0D4h], EDX 0042d8f8 mov eax, dword ptr [ebp-0D0h] <------ 0042d8fe mov Cl, byte PTR [eax] | 0042d900 mov byte PTR [ebp-0D5h], CL | // calculate 0042d906 add dword ptr [ebp-0D0h], 1 | 0042d90d CMP byte PTR [ebp-0D5h], 0 | 0042d914 JNE main + 38 H (42d8f8h) // --------- 0042d916 mov edX, dword ptr [ebp-0D0h] 0042d91c sub edX, dword ptr [ebp-0D4h] 0042d922 mov dword ptr [ebp-0DCh], edX 0042d928 mov eax, dword ptr [ebp-0DCh] 0042d92e mov dword ptr [Len], eax

If it is enabled under releaseMinimum size (/O1)And enableInternal function (/OI)The compiled code is as follows:

Int Len = strlen (argv [0]); 00401000 mov eax, dword ptr [esp + 8] 00401004 mov eax, dword ptr [eax] 00401006 Lea edX, [eax + 1] 00401009 mov Cl, byte PTR [eax] <------ 0040100b Inc eax | // calculate 0040100c test Cl, CL by byte | 0040100e JNE main + 9 (401009 H) --------- 00401010 sub eax, EDX

The code is much more concise, and there is no function call overhead (in fact, you will be surprised to find that the Code is the disassembly code of strlen In the second C language version at the beginning of this article, of course it is the optimized code, which saves the call overhead. In fact, the two strlen at the beginning of this article will be optimized and embedded by the compiler when higher optimization levels are enabled, which is consistent with the intrinsic function. This shows that the compiler is user-friendly. As long as it can meet the optimization conditions, it will be decisively optimized ). EnableMinimum size (/O1)Optimize and enableInternal function (/OI)Optimization and release EnabledMaximum speed (/O2)OrFully optimized (/Ox)The generated code is consistent. Enabled with releaseMaximum speed (/O2)OrFully optimized (/Ox)Even if you do not enableInternal function (/OI)Optimization, the compiler will also process strlen to generate the above Code. This is related to the level of optimization. When the level is high, it will naturally be more comprehensive optimization, whether or not you have to set something. It is also a user-friendly design.
To enable a function for internal function optimization, you can use the code to enable it, as shown below:

#pragma intrinsic( strlen )

If yes, It is disabled as follows:

#pragma function( strlen )

Disable strlen optimization.Maximum speed (/O2)OrFully optimized (/Ox)Also, the strlen function of CRT will be called. For more information, see msdn or click here.

For this intrinsic pragma, msdn has a detailed and accurate explanation, or the original English text can better understand its intention:

TheIntrinsicPragma tells the compiler that a function has known behavior. The Compiler may call the function and not replace the function call with Inline instructions, if it will result in better performance.
.........

Programs that use intrinsic functions are faster because they do not have the overhead of function CILS but may be larger due to the additional code generated.

By the way, don't try to use these two things to forcibly enable or disable the (/OI) Optimization of a common function. The so-called intrinsic is of course some functions defined in the compiler, it is also possible to optimize some details. If you don't believe me, you will surely get a warning:

Warning c4163: "XXXXX": cannot be used as an internal function.

For the optimization of intrinsic, the compiler is flexible, which means it is not mandatory. If SSE is enabled, the compiler will also consider SSE optimization. In principle, I know this is the case. The focus of this Article is on how to mine and think about many details. For details about specific and specific functions, refer to msdn or click the previous link. I will not go into details here. It has been written for so long ..

At the same time, I once again lamented the details of Ms engineers. This is also worth pondering by coder in the impetuous environment of the domestic IT industry.

Now, this article is over. Thank you for your guidance. Thks ~

* **************** If You Need To reprint it, please indicate the source: Region **********************

[Secrets vc crt library intel module] Directory
[Revealing the Intel module of vc crt library] -- Opening
[Reveal the Intel module of vc crt library] -- strlen

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.