The storage structure of the computer is hierarchical, from fast to slow, the cost is from high to low, the capacity is from small to large, registers, L1 cache, L2 cache, until the disk, it is even slower than the disk, so when the program is running, it is inevitable that there will beCopyThis is very important. From the perspective of optimization, it is often to reduce the cost of replication. For example, if the read and write orders of disks are known, it is better to adopt directio, directly read the user memory, instead of reading the kernel memory, and then copying it to the user memory. In this example, I plan to give it in future experiments. Of course, directio is equivalent to implementing the cache by the programmer, there is no advantage in reading and writing small-scale data, because the replication overhead saved is very small.
This article reduces cache line backfilling (also a replication) to show the benefits of replication. First, I want to know some background knowledge:
The CPU uses the chip cache (L1 cache) to interact with the memory. The interaction unit is cache line, which is a power of 2, 32 or 64 bytes. The virtual memory page size is 4 kb. Cache line interaction can be divided into two types: refill and write-back. Assume that the CPU needs to read the operands of one byte from the virtual memory. The address is 0xffffffa3 and the cache line is 32 bytes. Then, the CPU needs to read all the 32 bytes starting with the address 0xffffffa0, fill in a complete cache line, and then obtain the content of this byte from the 4th bytes of the cache line. If subsequent commands also need to be read from the cache line, fill in the cache
Line is worth it; otherwise, it is a waste of additional time to fill the cache line. Therefore, the better the locality of commands and data, the more consistent with the cache line design requirements.
We are all very familiar with the memset function. If we write a memset, it may be inferior to the implementation performance of the database function memset. Why? One of the major optimization technologies is called non-temporal. The basic idea is that if the memory data to be written is not used, the data is directly written without the need to backfill the cache line. The involved Commands include movnti, movntdq, and sfence. In layman's terms, we need to put a character (char) and memset on a piece of memory, and the original data on this piece of memory is obviously useless, that is, you do not need to refill the content of this piece of data into cacheline, rewrite it in cacheline, and then write it back to the memory. You only need to write the data directly into the memory. Memset of library functions uses General registers instead of SSE registers. The movnti command can be used to export the memset code of library functions through the objdump command. See the end of this article.
Why is refill required? Isn't refill required? Suppose we write 1 byte to a piece of memory (assuming that a cache line is 32 bytes) and the unit of data interaction is cache line. How does the system know what the other 31 bytes are? This write-back is correct, and the other 31 bytes are undefined. Therefore, it is necessary to read and write refill for small data. However, for writing large volumes of memory, refill can be avoided. Intel's non-temperal method serves this purpose, this is the optimization idea of reducing replication in this article.
Some other tips in the Code are not described in detail. If you have any feedback, I will write a further answer. For details, see the following code.
# Include <stdlib. h> <br/> # include <string. h> <br/> # include <stdio. h> <br/> # include <iostream> <br/> typedef unsigned char _ attribute _ (aligned (16) fill_t [16]; <br/> using namespace STD; <br/> void naive_memset (void * Page, unsigned char fill, size_t count) <br/>{< br/> unsigned char * DST = (unsigned char *) page; <br/> unsigned char * end = DST + count; <br/> for (; DST <end;) <br/>{< br/> * DST ++ = fill; <br/> }; </P> <p> void my_memset (void * Page, unsigned char fill, size_t count) <br/>{< br/> unsigned char * DST = (unsigned char *) page; <br/> fill_t dfill; <br/> for (size_t I = 0; I <16 ;) <br/>{< br/> dfill [I ++] = fill; <br/>}< br/> _ ASM _ volatile _ (<br/> "movdqa (% 0 ), % xmm0/N "<br/>" movdqa % xmm0, % xmm1/N "<br/>" movdqa % xmm0, % xmm2/N "<br/>" movdqa % xmm0, % xmm3/N "<br/>" movdqa % xmm0, % xmm4/N "<br/>" movdqa % xmm0, % xmm5/N "<br/>" movdqa % xmm0, % xmm6/N "<br/>" movdqa % xmm0, % xmm7/N "<br/>:" R "(dfill) <br/> ); <br/> while (long) DST & 0xf) & (count> 0) {<br/> * DST ++ = fill; <br/> count --; <br/>}< br/> size_t m_loop = count/128; <br/> size_t r = count % 128; <br/> for (size_t I = 0; I <m_loop; ++ I) <br/> {<br/> _ ASM _ (<br/> "movntdq % xmm0, (% 0) /n "<br/>" movntdq % xmm1, 16 (% 0)/n "<br/>" movntdq % xmm2, 32 (% 0) /n "<br/>" movntdq % xmm3, 48 (% 0)/n "<br/>" movntdq % xmm4, 64 (% 0) /n "<br/>" movntdq % xmm5, 80 (% 0)/n "<br/>" movntdq % xmm6, 96 (% 0) /n "<br/>" movntdq % xmm7, 112 (% 0)/n "<br/>:" R "(DST):" Memory "); <br/> DST ++ = 128; <br/>}< br/> for (INT I = 0; I <r; ++ I) <br/>{< br/> * DST ++ = fill; <br/>}< br/> _ ASM _ volatile _ (<br/> "sfence/N" <br/> :: <br/>); <br/>}; <br/> # If defined (_ i386 __) <br/> static _ inline _ unsigned long rdtsc (void) <br/>{< br/> unsigned long int X; <br/> _ ASM _ volatile ("rdtsc": "= A" (x); <br/> return X; <br/>}< br/> # Elif defined (_ x86_64 _) <br/> static _ inline _ unsigned long rdtsc (void) <br/>{< br/> unsigned hi, lo; <br/> _ ASM _ volatile _ ("rdtsc ": "= A" (LO), "= D" (HI); <br/> return (unsigned long) LO) | (unsigned long) HI) <32); <br/>}< br/> # endif <br/> int main () <br/>{< br/> const size_t S = 40*1024*1024; <br/> void * P = malloc (s); <br/> memset (p, 0x0, S); <br/> unsigned long start = rdtsc (); <br/> unsigned char * q = (unsigned char *) P; <br/> # ifdef _ naive_mem <br/> naive_memset (p, 0x1, S ); <br/> # endif <br/> # ifdef _ my_mem <br/> my_memset (p, 0x1, S ); <br/> # endif <br/> # ifdef _ mem <br/> memset (p, 0x1, S ); <br/> # endif <br/> int sum = 0; <br/> for (INT I = 0; I <s-1; ++ I) <br/>{< br/> sum + = * q; <br/> + + q; <br/>}< br/> cout <"sum: "<sum <Endl; <br/> cout <" Run time: "<rdtsc ()-start <Endl; <br/> free (P ); <br/> return 0; <br/>}</P> <p>
The assembly code of the library function memset of glibc:
10000003fb6279540 <memset>:
3fb6279540: 48 83 fa 07 CMP $0x7, % RDX
3fb6279544: 48 89 F9 mov % RDI, % rcX
3fb6279547: 0f 86 96 00 00 00 jbe 3fb62795e3 <memset + 0xa3>
3fb627954d: 49 B8 01 01 01 01 mov $0x101010101010101, % R8
3fb6279554: 01 01 01
3fb6279557: 40 0f B6 C6 movzbl % Sil, % eax
3fb627955b: 4C 0f af C0 imul % rax, % R8
3fb627955f: F7 C7 07 00 00 00 test $0x7, % EDI
3fb6279565: 74 1A je 3fb6279581 <memset + 0x41>
3fb6279567: 66 0f 1f 84 00 00 nopw 0x0 (% rax, % rax, 1)
3fb627956e: 00 00
3fb6279570: 40 88 31 mov % Sil, (% rcX)
3fb6279573: 48 ff ca dec % RDX
3fb6279576: 48 ff C1 Inc % rcX
3fb6279579: F7 C1 07 00 00 00 test $0x7, % ECx
3fb627957f: 75 ef jne 3fb6279570 <memset + 0x30>
3fb6279581: 48 89 D0 mov % RDX, % Rax
Pennyliang5: 48 C1 E8 06 SHR $0x6, % Rax
3fb6279588: 74 3E je 3fb62795c8 <memset + 0x88>
3fb627958a: 48 81 fa C0 D4 01 00 CMP $0x1d4c0, % RDX
3fb6279591: 73 6D Jae 3fb6279600 <memset + 0xc0>
3fb6279593: 66 0f 1f 44 00 00 nopw 0x0 (% rax, % rax, 1)
3fb6279599: 0f 1f 80 00 00 00 00 nopl 0x0 (% Rax)
3fb62795a0: 4C 89 01 mov % R8, (% rcX)
3fb62795a3: 4C 89 41 08 mov % R8, 0x8 (% rcX)
3fb62795a7: 4C 89 41 10 mov % R8, 0x10 (% rcX)
3fb62795ab: 4C 89 41 18 mov % R8, 0x18 (% rcX)
3fb62795af: 4C 89 41 20 mov % R8, 0x20 (% rcX)
3fb62795b3: 4C 89 41 28 mov % R8, 0x28 (% rcX)
3fb62795b7: 4C 89 41 30 mov % R8, 0x30 (% rcX)
3fb62795bb: 4C 89 41 38 mov % R8, 0x38 (% rcX)
3fb62795bf: 48 83 C1 40 add $0x40, % rcX
3fb62795c3: 48 ff C8 dec % Rax
3fb62795c6: 75 D8 JNE 3fb62795a0 <memset + 0x60>
3fb62795c8: 83 E2 3f and $ 0x3f, % edX
3fb62795cb: 48 89 D0 mov % RDX, % Rax
3fb62795ce: 48 C1 E8 03 SHR $0x3, % Rax
3fb62795d2: 74 0C je 3fb62795e0 <memset + 0xa0>
3fb62795d4: 4C 89 01 mov % R8, (% rcX)
3fb62795d7: 48 83 C1 08 add $0x8, % rcX
3fb62795db: 48 ff C8 dec % Rax
3fb62795de: 75 F4 JNE 3fb62795d4 <memset + 0x94>
3fb62795e0: 83 E2 07 and $0x7, % edX
3fb62795e3: 48 85 D2 test % RDX, % RDX
3fb62795e6: 74 0b je 3fb62795f3 <memset + 0xb3>
3fb62795e8: 40 88 31 mov % Sil, (% rcX)
3fb62795eb: 48 ff C1 Inc % rcX
3fb62795ee: 48 ff ca dec % RDX
3fb62795f1: 75 F5 JNE 3fb62795e8 <memset + 0xa8>
3fb62795f3: 48 89 F8 mov % RDI, % Rax
3fb62795f6: C3 retq
3fb62795f7: 66 0f 1f 84 00 00 nopw 0x0 (% rax, % rax, 1)
3fb62795fe: 00 00
3fb6279600: 4C 0f C3 01 movnti % R8, (% rcX)
3fb6279604: 4C 0f C3 41 08 movnti % R8, 0x8 (% rcX)
3fb6279609: 4C 0f C3 41 10 movnti % R8, 0x10 (% rcX)
3fb627960e: 4C 0f C3 41 18 movnti % R8, 0x18 (% rcX)
3fb6279613: 4C 0f C3 41 20 movnti % R8, 0x20 (% rcX)
3fb6279618: 4C 0f C3 41 28 movnti % R8, 0x28 (% rcX)
3fb627961d: 4C 0f C3 41 30 movnti % R8, 0x30 (% rcX)
3fb6279622: 4C 0f C3 41 38 movnti % R8, 0x38 (% rcX)
3fb6279627: 48 83 C1 40 add $0x40, % rcX
3fb627962b: 48 ff C8 dec % Rax
3fb627962e: 75 D0 JNE 3fb6279600 <memset + 0xc0>
3fb6279630: 0f AE F8 sfence
3fb6279633: EB 93 JMP 3fb62795c8 <memset + 0x88>
3fb6279635: 90 NOP
3fb6279636: 90 NOP
3fb6279637: 90 NOP
3fb6279638: 90 NOP
3fb6279639: 90 NOP
3fb627963a: 90 NOP
3fb627963b: 90 NOP
3fb627963c: 90 NOP
3fb627963d: 90 NOP
3fb627963e: 90 NOP
3fb627963f: 90 NOP
For more information about this blog, see:
Http://blog.csdn.net/pennyliang/archive/2011/01/18/6151062.aspx
Http://blog.csdn.net/pennyliang/archive/2011/01/20/6154929.aspx