Find a faster memory copy method

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I found this article]

Faster memory copy than memcpy

An xmemcpy tool is used for memory copying. It is said that the memcpy provided by glibc is 10 times faster than the memcpy provided by glibc when the data is copied within 120 bytes, And there is experimental data.

This is quite surprising. Memcpy has always been very efficient. Compared with the byte copy of functions such as strcpy, memcpy
A word is copied word by word. A word is 4 (32-bit) or 8 (64-bit) bytes. The CPU accesses a byte and accesses a word in the same way. It is within a single instruction and memory period.
Completed. Obviously, word-based copy is more efficient.

So what does xmemcpy rely on to achieve "10 times faster" than memcpy?
Taking a look at the implementation of xmemcpy, the original speed was as follows: "copying small memory, directly assigning values using equal signs is much faster than memcpy ".
This is even more confusing. Isn't memory copy a part of the memory copied to another part of the memory? Is there space for improving the performance of copy-by-word?

I wrote a piece of code:

}# Include <stdio. h> # define testsize 128 struct node { char Buf [testsize]; }; void main () { char SRC [testsize] = {0}; char DST [testsize]; * (struct node *) DST = * (struct node *) SRC; }

Then disassemble:

...... 00000000004004a8 <main >: 4004a8: 55 push % RBP 4004a9: 48 89 E5 mov % RSP, % RBP 4004ac: 48 81 EC 00 01 00 00 sub $0x100, % RSP 4004b3: 48 8d 7d 80 Lea 0xffffffffffffff80 (% RBP), % RDI 4004b7: ba 80 00 00 00 mov $0x80, % edX 4004bc: Be 00 00 00 00 mov $0x0, % ESI 4004c1: e8 1A FF callq 4003e0 <memset @ PLT> 4004c6: 48 8B 45 80 mov 0xffffffffffffff80 (% RBP), % Rax 4004ca: 48 89 85 00 FF mov % rax, 0xffffffffffff00 (% RBP) 4004d1: 48 8B 45 88 mov 0xffffffffffffff88 (% RBP ), % Rax ...... 400564: 48 89 85 70 FF mov % rax, 0xffffffffffffffff70 (% RBP) 40056b: 48 8B 45 F8 mov 0xfffffffffffff8 (% RBP ), % Rax 40056f: 48 89 85 78 FF mov % rax, 0xffffffffffff78 (% RBP) 400576: C9 leaveq 400577: c3 retq 400578: 90 NOP ......

Then we will disassemble libc and find out the implementation of memcpy for comparison:

...... 0006b400 <memcpy>: 6b400: 8B 4C 24 0C mov 0xc (% ESP), % ECx 6b404: 89 F8 mov % EDI, % eax 6b406: 8B 7C 24 04 mov 0x4 (% ESP), % EDI 6b40a: 89 F2 mov % ESI, % edX 6b40c: 8B 74 24 08 mov 0x8 (% ESP), % ESI 6b410: fc CLD 6b411: d1 E9 SHR % ECx 6b413: 73 01 Jae 6b416 <memcpy + 0x16> 6b415: A4 movsb % DS :( % Esi ), % es :( % EDI) 6b416: d1 E9 SHR % ECx 6b418: 73 02 Jae 6b41c <memcpy + 0x1c> 6b41a: 66 A5 movsw % DS :( % Esi), % es :( % EDI) 6b41c: F3 A5 repz movsl % DS :( % Esi), % es :( % EDI) 6b41e: 89 C7 mov % eax, % EDI 6b420: 89 D6 mov % edX, % ESI 6b422: 8b 44 24 04 mov 0x4 (% ESP), % eax 6b426: C3 RET 6b427: 90 NOP ......

Both of them are implemented by means of word copy. However, the equal sign assignment is translated into a series of mov commands by the compiler, while memcpy is a loop. "Equal sign assignment" is faster than memcpy, not in the copy method, but in the program process.
(In addition, the test shows that the length of "equal sign assignment" must be less than or equal to 128, and is a multiple of the machine's word length before being compiled into a continuous mov form, otherwise, it will be compiled to call memcpy. Of course, it is determined by the compiler .)

Why is the continuous mov command faster than the cyclic mov command when the copy is based on the machine's word length?
In the cyclic mode, after each mov operation, you need to: 1. Determine whether the copy is complete; 2. Jump to continue copying.
For each copy of a letter, the CPU needs to execute the above two actions.

In addition to adding judgment and jump commands, the impact on CPU processing flow is also negligible. The CPU divides the execution of commands into several stages and forms a command processing pipeline. This allows you to complete an instruction in a CPU clock cycle, improving the CPU computing speed.
The command flow can only be executed according to a single command path. If there is a branch (Judgment + jump), the flow can't be processed.
To reduce the impact of the Branch on the flow, the CPU may adopt a certain branch prediction strategy. However, branch prediction may fail. If it fails, the loss will be greater than the failure prediction.

Therefore, loops are a waste. If the efficiency requirement is high, in many cases, we need to expand the cycle (for example, in this example, copy n Bytes in each loop) to avoid the judgment and jump taking up a lot of CPU time.
This is a method of changing the space for time.
GCC has the compilation option to automatically expand the loop (for example
:-Funroll-loops ).

However, loop expansion should also have a degree, not the better (even if the waste of space is not considered ). Because the fast execution of the CPU depends on the cache, if the cache does not hit, the CPU
It will waste a lot of time waiting for memory (the memory speed is generally an order of magnitude lower than the CPU speed ). The short loop structure is more conducive to cache hit, because a piece of code that is executed repeatedly is easy to be
Hardware is placed in the cache, which is the benefit of code locality. Excessive loop expansion breaks the code locality. Therefore, xmemcpy mentioned that copying less than 120 bytes at the beginning. If you want
If more bytes are copied, it is not efficient to expand all into consecutive mov commands.

To sum up, the "equal sign assignment" is faster than memcpy because it omits the processing of the CPU's judgment and jump, and eliminates the impact of the Branch on the CPU Flow. All of this is achieved by moderately expanding the memory copy loop.

Some people say that strcpy is faster than memcpy.

Http://wenda.tianya.cn/wenda/thread? Tid = 444a79853b705fe2]

Memcpy strcpy which is faster
Anonymous questions

11:24:02

It is strange that strcpy is much faster than memcpy. In VC, strcpy mainly uses the General Arithmetic registers eax and EDX, and copymemcpy Based on the machine Character length mainly uses EDI, ESI, machine-Long COPY, and sse2 acceleration may be used. I personally think that memcpy should be faster, because it can be transferred in batches, and strcpy must check whether the end of the string is reached. However, the actual test result is that mempcy is much slower. Today, we will test which one is faster.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Find a faster memory copy method

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support