Linux programming odd tricks: 17 (how to reach the maximum memory bandwidth, streamline commands, prefetch, and NT)

Source: Internet
Author: User

Connect back: http://blog.csdn.net/pennyliang/archive/2011/03/08/6231709.aspx

There is no room for optimization in the method of using complex commands. The optimization work relies entirely on Intel engineers and uses streamlined commands to expand the cycle. More techniques can be used, including prefetch, NT, I have not said much about the previous article in the non-Temp series. In addition, Registers starting with R8 are used here, which are newly added with 64 bits and are generally used in such streaming data.

This series of readers mainly targets students at school. They hope that the students will use more code to do experiments, learn how to improve the coding capability, and learn about the architecture and other teaching materials, most of them do not provide executable source code. If there is no system experiment, it is difficult to have a deep understanding.

This book "Understanding the computer system in depth" is good, but many examples are not executable programs. Most people forget it once they read it and it is difficult to form a systematic understanding, painstaking research and system experiments are what this series of blogs want to achieve. If it can affect some people, it is really good.

 

The following section describes the SSE instruction set, XMM registers, and memory read/write modes. Finally, the experiment data of these methods is given.

 

Note: because it is an experiment, the assumption of alignment is only used by the experiment, because if alignment is to be done, the code is bound to be added, and the attention to the subject is lost.

 

# Include "stdlib. H "<br/> # include" string. H "<br/> # If defined (_ i386 _) <br/> static _ inline _ unsigned long rdtsc (void) <br/>{< br/> unsigned long int X; <br/> _ ASM _ volatile (". byte 0x0f, 0x31 ":" = A "(x); <br/> return X; <br/>}< br/> # Elif defined (_ x86_64 _) <br/> static _ inline _ unsigned long rdtsc (void) <br/>{< br/> unsigned hi, lo; <br/> _ ASM _ volatile _ ("rdtsc ": "= A" (LO), "= D" (HI); <br/> return (unsigned long) LO) | (unsigned long) HI) <32); <br/>}< br/> # endif <br/>__ ASM __(". text/N "<br/> ". type m_ B _64, @ function/N "<br/>" m_ B _64: Push % RBP/N "<br/>" mov % RSP, % RBP/N "<br/>" mov % RDX, % rcX/N "<br/>" rep movsq/N "<br/>" leaveq/N "<br/>" retq/N "); <br/>__ ASM __(". text/N "<br/> ". type M_c, @ function/N "<br/>" M_c: Push % RBP/N "<br/>" mov % RSP, % RBP/N "<br/>" CP: movq (% RSI), % Rax/N "<br/>" movq % rax, (% RDI) /n "<br/>" addq $8, % RSI/N "<br/>" addq $8, % RDI/N "<br/>" Dec % RDX/N "<br/>" jnz CP/N "<br/>" leaveq/N "<br/>" retq/N "); <br/>__ ASM __(". text/N "<br/> ". type m_c_2, @ function/N "<br/>" m_c_2: Push % RBP/N "<br/>" mov % RSP, % RBP/N "<br/>" CO: prefetcht0 1024 (% RSI)/n "<br/>" movq (% RSI ), % Rax/N "<br/>" movq 8 (% RSI), % RBx/N "<br/>" movq 16 (% RSI ), % rcX/N "<br/>" movq 24 (% RSI), % R8/N "<br/>" movq 32 (% RSI ), % R9/N "<br/>" movq 40 (% RSI), % R10/N "<br/>" movq 48 (% RSI ), % R11/N "<br/>" movq 56 (% RSI), % R12/N "<br/>" movnti % rax, (% RDI) /n "<br/>" movnti % RBx, 8 (% RDI)/n "<br/>" movnti % rcX, 16 (% RDI) /n "<br/>" movnti % R8, 24 (% RDI)/n "<br/>" movnti % R9, 32 (% RDI) /n "<br/>" movnti % R10, 40 (% RDI)/n "<br/>" movnti % R11, 48 (% RDI) /n "<br/>" movnti % R12, 56 (% RDI)/n "<br/>" addq $64, % RSI/N "<br/>" addq $64, % RDI/N "<br/>" subq $8, % RDX/N "// PROCESS 8 quad words at a time <br/>" jnz Co/N "<br/>" leaveq/N "<br/>" retq/n "); <br/> int main (void) <br/> {<br/> int bytes_cnt = 32*1024*1024; // 32 M bytes <br/> int word_cnt = bytes_cnt/2; // 16 m words <br/> int dword_cnt = word_cnt/2; // 8 m double words <br/> int qdword_cnt = dword_cnt/2; // 4 m quad words <br/> char * From = (char *) malloc (bytes_cnt ); <br/> char * To = (char *) malloc (bytes_cnt); <br/> memset (from, 0x1, bytes_cnt); <br/> memset (, 0x0, bytes_cnt); <br/> unsigned long start; <br/> unsigned long end; <br/> int I; <br/> for (I = 0; I <10; ++ I) <br/>{< br/> Start = rdtsc (); <br/> m_ B _64 (to, from, qdword_cnt); <br/> end = rdtsc (); <br/> printf ("m_ B _64 use time: /T % d/N ", end-Start); <br/>}< br/> for (I = 0; I <10; ++ I) <br/>{< br/> Start = rdtsc (); <br/> M_c (to, from, qdword_cnt); <br/> end = rdtsc (); <br/> printf ("M_c use time:/T % d/N", end-Start); <br/>}< br/> for (I = 0; I <10; ++ I) <br/>{< br/> Start = rdtsc (); <br/> m_c_2 (to, from, qdword_cnt ); <br/> end = rdtsc (); <br/> printf ("m_c_2 use time:/T % d/N", end-Start ); <br/>}< br/>/********* use to make sure CPY is OK ********** </P> <p> int sum = 0; <br/> for (I = 0; I <bytes_cnt; ++ I) <br/> sum + = to [I]; <br/> printf ("sum: % d/N ", sum ); <br/> ************************************ * *******/<br/> return 0; <br/>}< br/>

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.