Faster memory copy than memcpy

Source: Internet
Author: User

An xmemcpy tool is used for memory copying. It is said that the memcpy provided by glibc is 10 times faster than the memcpy provided by glibc when the data is copied within 120 bytes, And there is experimental data.

This is quite surprising. Memcpy has always been very efficient. Compared with the byte copy of functions such as strcpy, memcpy copies data word by word based on the machine Name Length. A word is 4 (32-bit) or 8 (64-bit) bytes. CPU access to a byte is the same as access to a word, and is completed within a single instruction and memory cycle. Obviously, word-based copy is more efficient.

So what does xmemcpy rely on to achieve "10 times faster" than memcpy?
Taking a look at the implementation of xmemcpy, the original speed was as follows: "copying small memory, directly assigning values using equal signs is much faster than memcpy ".
This is even more confusing. Isn't memory copy a part of the memory copied to another part of the memory? Is there space for improving the performance of copy-by-word?

Write a paragraphCode:

# Include <stdio. h>
# Define testsize 128
Struct node {
Char Buf [testsize];
};
Void main ()
{
Char SRC [testsize] = {0 };
Char DST [testsize];
* (Struct node *) DST = * (struct node *) SRC;
}
Then disassemble:

......
00000000004004a8 <main>:
4004a8: 55 push % RBP
4004a9: 48 89 E5 mov % RSP, % RBP
4004ac: 48 81 EC 00 01 00 00 sub $0x100, % RSP
4004b3: 48 8d 7d 80 Lea 0xffffffffffff80 (% RBP), % RDI
4004b7: BA 80 00 00 00 mov $0x80, % edX
4004bc: Be 00 00 00 00 mov $0x0, % ESI
4004c1: E8 1A FF callq 4003e0 < Memset @ PLT >
4004c6: 48 8B 45 80 mov 0xffffffffffff80 (% RBP), % Rax
4004ca: 48 89 85 00 FF mov % rax, 0xffffffffffffff00 (% RBP)
4004d1: 48 8B 45 88 mov 0xffffffffffff88 (% RBP), % Rax
......
400564: 48 89 85 70 FF mov % rax, 0xffffffffffffff70 (% RBP)
40056b: 48 8B 45 F8 mov 0xfffffffffffff8 (% RBP), % Rax
40056f: 48 89 85 78 FF mov % rax, 0xffffffffffffff78 (% RBP)
400576: C9 leaveq
400577: C3 retq
400578: 90 NOP
......

Then we will disassemble libc and find out the implementation of memcpy for comparison:

......
0006b400 <memcpy>:
6b400: 8B 4C 24 0C mov 0xc (% ESP), % ECx
6b404: 89 F8 mov % EDI, % eax
6b406: 8B 7C 24 04 mov 0x4 (% ESP), % EDI
6b40a: 89 F2 mov % ESI, % edX
6b40c: 8B 74 24 08 mov 0x8 (% ESP), % ESI
6b410: fc ClD
6b411: d1 E9 SHR % ECx
6b413: 73 01 Jae 6b416 <memcpy + 0x16>
6b415: A4 movsb % DS :( % Esi), % es :( % EDI)
6b416: d1 E9 SHR % ECx
6b418: 73 02 Jae 6b41c <memcpy + 0x1c>
6b41a: 66 A5 movsw % DS :( % Esi), % es :( % EDI)
6b41c: F3 A5 repz movsl % DS :( % Esi), % es :( % EDI)
6b41e: 89 C7 mov % eax, % EDI
6b420: 89 D6 mov % edX, % ESI
6b422: 8B 44 24 04 mov 0x4 (% ESP), % eax
6b426: C3 RET
6b427: 90 NOP
......

Both of them are implemented by means of word copy. However, the equal sign assignment is translated into a series of mov commands by the compiler, while memcpy is a loop. "Equal sign assignment" is faster than memcpy, not in the copy method, but inProgramProcess.
(In addition, the test shows that the length of "equal sign assignment" must be less than or equal to 128, and is a multiple of the machine's word length before being compiled into a continuous mov form, otherwise, it will be compiled to call memcpy. Of course, it is determined by the compiler .)

Why is the continuous mov command faster than the cyclic mov command when the copy is based on the machine's word length?
In the cyclic mode, after each mov operation, you need to: 1. Determine whether the copy is complete; 2. Jump to continue copying.
For each copy of a letter, the CPU needs to execute the above two actions.

In addition to adding judgment and jump commands, the impact on CPU processing flow is also negligible. The CPU divides the execution of commands into several stages and forms a command processing pipeline. This allows you to complete an instruction in a CPU clock cycle, improving the CPU computing speed.
The command flow can only be executed according to a single command path. If there is a branch (Judgment + jump), the flow can't be processed.
To reduce the impact of the Branch on the flow, the CPU may adopt a certain branch prediction strategy. However, branch prediction may fail. If it fails, the loss will be greater than the failure prediction.

Therefore, loops are a waste. If the efficiency requirement is high, in many cases, we need to expand the cycle (for example, in this example, copy n Bytes in each loop) to avoid the judgment and jump taking up a lot of CPU time.This is a method of changing the space for time.GCC has the compilation option to automatically expand the loop (for example:-Funroll-Loops).
However, loop expansion should also have a degree, not the better (even if the waste of space is not considered ). Because the fast execution of the CPU depends on the cache, if the cache does not hit, the CPU will waste a lot of time waiting for the memory (the memory speed is generally an order of magnitude lower than the CPU speed ). The short loop structure is more conducive to cache hit, because a piece of code that is executed repeatedly is easily put into the cache by hardware, which is the benefit of code locality. Excessive loop expansion breaks the code locality. Therefore, xmemcpy mentioned that copying less than 120 bytes at the beginning. If more bytes are to be copied, it is not efficient to expand all the bytes into consecutive mov commands.

To sum up, the "equal sign assignment" is faster than memcpy because it omits the processing of the CPU's judgment and jump, and eliminates the impact of the Branch on the CPU Flow. All of this is achieved by moderately expanding the memory copy loop.

Transferred from:Http://hi.baidu.com/_kouu/blog/item/c9761112ff2eca0b5baf5342.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.