Connect back: http://blog.csdn.net/pennyliang/archive/2011/03/08/6231709.aspx
There is no room for optimization in the method of using complex commands. The optimization work relies entirely on Intel engineers and uses streamlined commands to expand the cycle. More techniques can be used, including prefetch, NT, I have not said much about the previous article in the non-Temp series. In addition, Registers starting with R8 are used here, which are newly added with 64 bits and are generally used in such streaming data.
This series of readers mainly targets students at school. They hope that the students will use more code to do experiments, learn how to improve the coding capability, and learn about the architecture and other teaching materials, most of them do not provide executable source code. If there is no system experiment, it is difficult to have a deep understanding.
This book "Understanding the computer system in depth" is good, but many examples are not executable programs. Most people forget it once they read it and it is difficult to form a systematic understanding, painstaking research and system experiments are what this series of blogs want to achieve. If it can affect some people, it is really good.
The following section describes the SSE instruction set, XMM registers, and memory read/write modes. Finally, the experiment data of these methods is given.
Note: because it is an experiment, the assumption of alignment is only used by the experiment, because if alignment is to be done, the code is bound to be added, and the attention to the subject is lost.
# Include "stdlib. H "<br/> # include" string. H "<br/> # If defined (_ i386 _) <br/> static _ inline _ unsigned long rdtsc (void) <br/>{< br/> unsigned long int X; <br/> _ ASM _ volatile (". byte 0x0f, 0x31 ":" = A "(x); <br/> return X; <br/>}< br/> # Elif defined (_ x86_64 _) <br/> static _ inline _ unsigned long rdtsc (void) <br/>{< br/> unsigned hi, lo; <br/> _ ASM _ volatile _ ("rdtsc ": "= A" (LO), "= D" (HI); <br/> return (unsigned long) LO) | (unsigned long) HI) <32); <br/>}< br/> # endif <br/>__ ASM __(". text/N "<br/> ". type m_ B _64, @ function/N "<br/>" m_ B _64: Push % RBP/N "<br/>" mov % RSP, % RBP/N "<br/>" mov % RDX, % rcX/N "<br/>" rep movsq/N "<br/>" leaveq/N "<br/>" retq/N "); <br/>__ ASM __(". text/N "<br/> ". type M_c, @ function/N "<br/>" M_c: Push % RBP/N "<br/>" mov % RSP, % RBP/N "<br/>" CP: movq (% RSI), % Rax/N "<br/>" movq % rax, (% RDI) /n "<br/>" addq $8, % RSI/N "<br/>" addq $8, % RDI/N "<br/>" Dec % RDX/N "<br/>" jnz CP/N "<br/>" leaveq/N "<br/>" retq/N "); <br/>__ ASM __(". text/N "<br/> ". type m_c_2, @ function/N "<br/>" m_c_2: Push % RBP/N "<br/>" mov % RSP, % RBP/N "<br/>" CO: prefetcht0 1024 (% RSI)/n "<br/>" movq (% RSI ), % Rax/N "<br/>" movq 8 (% RSI), % RBx/N "<br/>" movq 16 (% RSI ), % rcX/N "<br/>" movq 24 (% RSI), % R8/N "<br/>" movq 32 (% RSI ), % R9/N "<br/>" movq 40 (% RSI), % R10/N "<br/>" movq 48 (% RSI ), % R11/N "<br/>" movq 56 (% RSI), % R12/N "<br/>" movnti % rax, (% RDI) /n "<br/>" movnti % RBx, 8 (% RDI)/n "<br/>" movnti % rcX, 16 (% RDI) /n "<br/>" movnti % R8, 24 (% RDI)/n "<br/>" movnti % R9, 32 (% RDI) /n "<br/>" movnti % R10, 40 (% RDI)/n "<br/>" movnti % R11, 48 (% RDI) /n "<br/>" movnti % R12, 56 (% RDI)/n "<br/>" addq $64, % RSI/N "<br/>" addq $64, % RDI/N "<br/>" subq $8, % RDX/N "// PROCESS 8 quad words at a time <br/>" jnz Co/N "<br/>" leaveq/N "<br/>" retq/n "); <br/> int main (void) <br/> {<br/> int bytes_cnt = 32*1024*1024; // 32 M bytes <br/> int word_cnt = bytes_cnt/2; // 16 m words <br/> int dword_cnt = word_cnt/2; // 8 m double words <br/> int qdword_cnt = dword_cnt/2; // 4 m quad words <br/> char * From = (char *) malloc (bytes_cnt ); <br/> char * To = (char *) malloc (bytes_cnt); <br/> memset (from, 0x1, bytes_cnt); <br/> memset (, 0x0, bytes_cnt); <br/> unsigned long start; <br/> unsigned long end; <br/> int I; <br/> for (I = 0; I <10; ++ I) <br/>{< br/> Start = rdtsc (); <br/> m_ B _64 (to, from, qdword_cnt); <br/> end = rdtsc (); <br/> printf ("m_ B _64 use time: /T % d/N ", end-Start); <br/>}< br/> for (I = 0; I <10; ++ I) <br/>{< br/> Start = rdtsc (); <br/> M_c (to, from, qdword_cnt); <br/> end = rdtsc (); <br/> printf ("M_c use time:/T % d/N", end-Start); <br/>}< br/> for (I = 0; I <10; ++ I) <br/>{< br/> Start = rdtsc (); <br/> m_c_2 (to, from, qdword_cnt ); <br/> end = rdtsc (); <br/> printf ("m_c_2 use time:/T % d/N", end-Start ); <br/>}< br/>/********* use to make sure CPY is OK ********** </P> <p> int sum = 0; <br/> for (I = 0; I <bytes_cnt; ++ I) <br/> sum + = to [I]; <br/> printf ("sum: % d/N ", sum ); <br/> ************************************ * *******/<br/> return 0; <br/>}< br/>