Find a faster memory copy method

Source: Internet
Author: User

I found this article]

Faster memory copy than memcpy

An xmemcpy tool is used for memory copying. It is said that the memcpy provided by glibc is 10 times faster than the memcpy provided by glibc when the data is copied within 120 bytes, And there is experimental data.

This is quite surprising. Memcpy has always been very efficient. Compared with the byte copy of functions such as strcpy, memcpy
A word is copied word by word. A word is 4 (32-bit) or 8 (64-bit) bytes. The CPU accesses a byte and accesses a word in the same way. It is within a single instruction and memory period.
Completed. Obviously, word-based copy is more efficient.

So what does xmemcpy rely on to achieve "10 times faster" than memcpy?
Taking a look at the implementation of xmemcpy, the original speed was as follows: "copying small memory, directly assigning values using equal signs is much faster than memcpy ".
This is even more confusing. Isn't memory copy a part of the memory copied to another part of the memory? Is there space for improving the performance of copy-by-word?

I wrote a piece of code:


}# Include <stdio. h> <br/> # define testsize 128 <br/> struct node {<br/> char Buf [testsize]; <br/> }; <br/> void main () <br/> {<br/> char SRC [testsize] = {0}; <br/> char DST [testsize]; <br/> * (struct node *) DST = * (struct node *) SRC; <br/>}
 


Then disassemble:

...... <Br/> 00000000004004a8 <main >:< br/> 4004a8: 55 push % RBP <br/> 4004a9: 48 89 E5 mov % RSP, % RBP <br/> 4004ac: 48 81 EC 00 01 00 00 sub $0x100, % RSP <br/> 4004b3: 48 8d 7d 80 Lea 0xffffffffffffff80 (% RBP), % RDI <br/> 4004b7: ba 80 00 00 00 mov $0x80, % edX <br/> 4004bc: Be 00 00 00 00 mov $0x0, % ESI <br/> 4004c1: e8 1A FF callq 4003e0 <memset @ PLT> <br/> 4004c6: 48 8B 45 80 mov 0xffffffffffffff80 (% RBP), % Rax <br/> 4004ca: 48 89 85 00 FF mov % rax, 0xffffffffffff00 (% RBP) <br/> 4004d1: 48 8B 45 88 mov 0xffffffffffffff88 (% RBP ), % Rax <br/> ...... <br/> 400564: 48 89 85 70 FF mov % rax, 0xffffffffffffffff70 (% RBP) <br/> 40056b: 48 8B 45 F8 mov 0xfffffffffffff8 (% RBP ), % Rax <br/> 40056f: 48 89 85 78 FF mov % rax, 0xffffffffffff78 (% RBP) <br/> 400576: C9 leaveq <br/> 400577: c3 retq <br/> 400578: 90 NOP <br/> ......

Then we will disassemble libc and find out the implementation of memcpy for comparison:

...... <Br/> 0006b400 <memcpy>: <br/> 6b400: 8B 4C 24 0C mov 0xc (% ESP), % ECx <br/> 6b404: 89 F8 mov % EDI, % eax <br/> 6b406: 8B 7C 24 04 mov 0x4 (% ESP), % EDI <br/> 6b40a: 89 F2 mov % ESI, % edX <br/> 6b40c: 8B 74 24 08 mov 0x8 (% ESP), % ESI <br/> 6b410: fc CLD <br/> 6b411: d1 E9 SHR % ECx <br/> 6b413: 73 01 Jae 6b416 <memcpy + 0x16> <br/> 6b415: A4 movsb % DS :( % Esi ), % es :( % EDI) <br/> 6b416: d1 E9 SHR % ECx <br/> 6b418: 73 02 Jae 6b41c <memcpy + 0x1c> <br/> 6b41a: 66 A5 movsw % DS :( % Esi), % es :( % EDI) <br/> 6b41c: F3 A5 repz movsl % DS :( % Esi), % es :( % EDI) <br/> 6b41e: 89 C7 mov % eax, % EDI <br/> 6b420: 89 D6 mov % edX, % ESI <br/> 6b422: 8b 44 24 04 mov 0x4 (% ESP), % eax <br/> 6b426: C3 RET <br/> 6b427: 90 NOP <br/> ......

Both of them are implemented by means of word copy. However, the equal sign assignment is translated into a series of mov commands by the compiler, while memcpy is a loop. "Equal sign assignment" is faster than memcpy, not in the copy method, but in the program process.
(In addition, the test shows that the length of "equal sign assignment" must be less than or equal to 128, and is a multiple of the machine's word length before being compiled into a continuous mov form, otherwise, it will be compiled to call memcpy. Of course, it is determined by the compiler .)

Why is the continuous mov command faster than the cyclic mov command when the copy is based on the machine's word length?
In the cyclic mode, after each mov operation, you need to: 1. Determine whether the copy is complete; 2. Jump to continue copying.
For each copy of a letter, the CPU needs to execute the above two actions.

In addition to adding judgment and jump commands, the impact on CPU processing flow is also negligible. The CPU divides the execution of commands into several stages and forms a command processing pipeline. This allows you to complete an instruction in a CPU clock cycle, improving the CPU computing speed.
The command flow can only be executed according to a single command path. If there is a branch (Judgment + jump), the flow can't be processed.
To reduce the impact of the Branch on the flow, the CPU may adopt a certain branch prediction strategy. However, branch prediction may fail. If it fails, the loss will be greater than the failure prediction.

Therefore, loops are a waste. If the efficiency requirement is high, in many cases, we need to expand the cycle (for example, in this example, copy n Bytes in each loop) to avoid the judgment and jump taking up a lot of CPU time.
This is a method of changing the space for time.
GCC has the compilation option to automatically expand the loop (for example
:-Funroll-loops ).

However, loop expansion should also have a degree, not the better (even if the waste of space is not considered ). Because the fast execution of the CPU depends on the cache, if the cache does not hit, the CPU
It will waste a lot of time waiting for memory (the memory speed is generally an order of magnitude lower than the CPU speed ). The short loop structure is more conducive to cache hit, because a piece of code that is executed repeatedly is easy to be
Hardware is placed in the cache, which is the benefit of code locality. Excessive loop expansion breaks the code locality. Therefore, xmemcpy mentioned that copying less than 120 bytes at the beginning. If you want
If more bytes are copied, it is not efficient to expand all into consecutive mov commands.



To sum up, the "equal sign assignment" is faster than memcpy because it omits the processing of the CPU's judgment and jump, and eliminates the impact of the Branch on the CPU Flow. All of this is achieved by moderately expanding the memory copy loop.

 

Some people say that strcpy is faster than memcpy.

Http://wenda.tianya.cn/wenda/thread? Tid = 444a79853b705fe2]

Memcpy strcpy which is faster
Anonymous questions


11:24:02

It is strange that strcpy is much faster than memcpy. In VC, strcpy mainly uses the General Arithmetic registers eax and EDX, and copymemcpy Based on the machine Character length mainly uses EDI, ESI, machine-Long COPY, and sse2 acceleration may be used. I personally think that memcpy should be faster, because it can be transferred in batches, and strcpy must check whether the end of the string is reached. However, the actual test result is that mempcy is much slower. Today, we will test which one is faster.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.