Write a faster memcpy.

Last Update:2013-11-15 Source: Internet

Author: User

Tags windows 7 x64

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Writing code is sometimes the most uncomfortable thing if faith breaks down, just like believing in religion. In my early years, I read an article about Optimization of memcpy by VC and Efficiency geek 2: copying data in C/C ++, optimisation. therefore, I firmly believe that it is difficult to write memcpy faster than the C Runtime Library. But there are two things recently that have led me to doubt this belief.

The first step is to read lz4 Code recently. lz4 may be the fastest memory compression algorithm at present. It is faster than snappy in some evaluations (the implementation of lz4 will be analyzed in a specific article later ). He studied his code and found that, unlike other codes, his memory copy uses a macro instead of memcpy. In the system, the uint64_t conversion pointer is directly used to copy and assign values. I personally think this is a place where he can speed up processing. The copy code is as follows, and its implementation does not care about the overflow of the word length.

   ZEN_TEST_FAST_COPY(dst,src,sz)  {\      *_cpy_dst =        *_cpy_src =     size_t _cpy_size =           ZBYTE_TO_UINT64(_cpy_dst) =     _cpy_dst +=      _cpy_src +=  }( _cpy_size > (uint64_t) && (_cpy_size -=  }

The second is to read the article "which memcpy is faster?". It was found that memcpy in the Linux Standard library was quite bad. This is a little revolutionary. A problem with the original code is that it initially operates on non-integral parts. However, if the address of the src Parameter of memcpy is aligned with dst, this is obviously not conducive to speed up. The improved code is as follows:

  *ZEN_OS::fast_memcpy( *dst,   *       *r =           size_t n = sz & ~((uint64_t) -      uint64_t *src_u64 = (uint64_t *     uint64_t *dst_u64 = (uint64_t *                *dst_u64++ = *src_u64++         n -=             n = sz & ((uint64_t) -      uint8_t *src_u8 = (uint8_t *     uint8_t *dst_u8 = (uint8_t *      (n--          (*dst_u8++ = *src_u8++        }

There is also a similar function in the article code. The difference is that it copies uint64_t data twice in each loop. For convenience, we call it fast_memcpy [2.

               *dst_u64++ = *src_u64++         *dst_u64++ = *src_u64++         n -= (uint64_t)*     }

To understand how it works, test it by yourself. The test copied 8, 16 ...... 64 K, 1 M, 4 M bytes of data. In LINUX 64 (GCC 4.3 O3 optimized), Windows 7 X64 (Visual 2010 realse), and Windwos7 win32 (realse) environments, high-precision timer is used to collect data ..

The first group of tests is performed with byte alignment data, so that boring data is not pasted directly. If you compare the slowest speed to 100%, and draw an image as a relative ratio to others, the lines below must be better:

Linux 64-bit (GCC 4.3 O3 optimized), and copy speed comparison in case of byte alignment, such,

Windows 7 X64, Visaul C ++ 2010 Realse, the copy speed in case of byte alignment, such

Looking at the above comparison, we will know that memcpy is in alignment, and memcpy is a good choice at any time.

However, there is another common case for memory copying, that is, the bytes cannot be aligned. lz4, as a compression algorithm, may often face the issue of non-alignment, so I also did a test for this situation.

Linux 64-bit (GCC 4.3 O3 optimized), copy speed in case of non-alignment of bytes, such,

Windows 7 X64, Visaul C ++ 2010 Realse, copying speed in case of non-alignment of bytes, such as macro copying

When the bytes are not aligned and the copied Memory Length is less than 256 bytes, the 8-byte value assignment speed is slightly higher than memcpy, which may be the reason why lz4 adopts this method. Memcpy is a better choice if the size of the copy is larger and not aligned. In most cases, the size of the copied bytes in lz4 should be smaller than 256 bytes. Therefore, he can gain some advantages theoretically by using macro copy.

Conclusion:

In the case of byte alignment, memcpy is almost the best choice at any time..

Some methods are faster than memcpy under certain conditions. HoweverIf you do not know how to choose, it is better to select memcpy by default. The same applies to the memset function..

On Windows, if the data length reaches 1 MB or 4 MB, Windows performs better than Linxu, especially on 64-bit platforms. (Of course, the Linux testing environment is a virtual machine. Is there any impact ?), The possible cause is that it is estimated that there may be GCC in command Optimization in Windows.

Writing a function faster than the runtime database memcpy is itself a little bullshit. Your opponent has a variety of means such as Compiler optimization, code execution optimization, and command optimization. Which memcpy is faster? There should be a mistake in this article. Of course, he may have his background (No optimization ?), The author may not clarify.

References and background:

Optimization of memcpy by VC: the optimization of the compiler for various copy lengths.

The authors of Efficiency geek 2: copying data in C/C ++, optimisation evaluated some Geek methods. Of course, the first part of memset is more detailed. Some Geek practices can be viewed. Memcpy of various versions (underlying optimization) explains some data copy optimization methods. The Memcpy mentioned in "Optimizing memcpy improves speed" is not the memcpy of the c Runtime Library, but this article also explains some methods for accelerating data copying.

Which memcpy is faster? Misleading my article. I think the stuff he mentioned should have a special scenario. Or does memcpy and I understand the C library as a concept?

The article C/C ++ tip: How to copy memory quickly also illustrates this problem. His conclusion is the same as ours.

[The author of this article is Yandu hentan. In the spirit of freedom, you can repost this article in a non-profit situation. For details, please attach the BLOG link: http://www.cnblogs.com/fullsail/. if the negative word is a dollar, every figure. Double the bidu library and 360doc price]

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More