Reprint: Memory copy memcpy () vs. Vmsplice () performance

Last Update:2015-08-20 Source: Internet

Author: User

Tags sendfile shebang

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Memory copy memcpy () and Vmsplice () Performance comparison overview

In the previous article, "Inter-process Big Data copy Method survey" introduced and compared three kinds of a process to read files and then copied to the B process, the test results show that when it comes to the data transfer between the memory and the disk, the splice method is best represented by avoiding multiple copies of the data between the kernel buffer and the user buffer. However, because this contrast is limited to the special situation that contains I/O reading and writing, and the process cannot modify the data, it is less common in practice, and the theoretical meaning is greater than the actual meaning.

The scenarios to be explored in this article are very common in the actual programming process:

A process has a large chunk of data in memory to pass to the B process.

In order to solve the problem of data transmission in this scenario, the two methods to be compared in this paper are:

To request shared memory, a process memcpy the data in, and B processes memcpy out.
To apply for a pair of pipelines, a process will vmsplice data to the pipeline write end, b process from the reading end and then vmsplice out into their own user space.

In practical applications we mostly use the first method, because it feels like memcpy is fast enough, so who is faster, and we still have to use the test data to speak.

Test method memcpy Method

For the memcpy method, the amount of time spent copying data from a process user space to shared memory and copying data from shared memory to the B process user space is counted.

Perhaps the reunion asked, why not directly define the variable into shared memory, so that a modified after the B directly use, no need to copy.

This is because we expect the status of a process to have its own data, B process also has a copy of their own data, they are not interfering with each other, in most of the time can keep the data inconsistent this state, only when necessary to copy.

The code is as follows:

Sendfile ()

if (mode = = 1)    {void *SHM = shared_memory;    void *src = Usrmem; int size=buf_size,total=0; while (Total < len) {size = len-total > buf_size? Buf_size:len-total; memcpy (Shm,src,size) SHM + = size; src + = size; Total + = size; //printf ("mode[%d" Sendfile () total=%d size=%d\n ", mode, total, size);} Pthread_mutex_unlock (&lock); //cal_md5 (Shared_memory, Len, MD5);}

Usrmem is the memory in the user space, and the data has been poured before sendfile. and locked during a process to the shared memory copy to block the read operation of the shared memory by the B process and unlock after the end.

GetFile ()

if(mode == 1){    void *shm = shared_memory;    void *dst = usrmem;    pthread_mutex_lock(&lock);    int size=BUF_SIZE,total=0; while(total < len) { size = len - total > BUF_SIZE ? BUF_SIZE : len-total; memcpy(dst, shm, size); shm += size; dst += size; total += size; //printf("mode[%d] getfile() total=%d size=%d\n", mode, total, size); } pthread_mutex_unlock(&lock);}

Reading data is also locked, preventing the process from being written to the data causing contamination and unlocking after the end.

Vmsplice method

Vmsplice is one of the splice function families, which is used to map user space to kernel space (the carrier is the pipeline), allowing the user process to manipulate the data of the kernel buffers directly.

Splice function families include:

Long splice (int fdin, Offt *offin,int fdout, Offt *offout,size_t len,unsignedint flags);

Long Tee (int fdin,int fdout,size_t len,unsignedint flags);

Long vmsplice (int fd,conststruct iovec *iov, Unsignedlong nr_segs,unsignedint flags);

Ssizet sendfile (int outfd,int infd, offt *offset,size_tcount);

You can see this series of functions to the user space, kernel space, file Flow FD mapping operation full coverage, the use of individual functions can view the man document.

Sendfile ()

Elseif (mode = =2) {Char*p = Usrmem; size_tsend; size_t left = Len; long ret; struct Iovec iov; long nr_segs = 1 ; int flags = 0x1; while (Left > 0) {send = left > Pipe_buf? Pipe_buf:left; Iov.iov_base = p; Iov.iov_len = send; ret = Vmsplice (pipefd[1], &iov, Nr_segs, Flags); if (ret =-1) {perror ( "Vmsplice Failed "); return;} Left-= ret; p + = ret; //printf ( "Mode[%d" Sendfile () Left=%d ret=%d\n ", mode, left, ret);}}

Vmsplice want to manipulate user space depends on the struct Iovec object, it has two members, Iovbase is a pointer to the user space, Iovlen is the size of the user space, due to the size limit of the pipeline (Linux default 4096), As a result, you need to constantly update the contents of the struct Iovec when doing vmsplice.

GetFile ()

Elseif (mode = =2) {char *p = Usrmem; size_tget; size_t left = Len; long ret; struct Iovec Iov; long nr_segs = 1; int flags = 0x1; while (Left > 0) {get = left > Pipe_buf? Pipe_buf:left; Iov.iov_base = p; Iov.iov_len = get; ret = Vmsplice (pipefd[0], &iov, Nr_segs, Flags); if (ret =-1) {perror ( "Vmsplice Failed "); return;} Left-= ret; p + = ret; //printf ("mode[%d" GetFile () left=%d ret=%d\n ", mode, left, ret);}}

GetFile only need to set the first parameter as the read end of the pipe, because Vmsplice is actually a mapping process, unlike the copy, it does not care about the two parameters who are the source who is the destination.

Contrast method

Compared to the CPU time consumed by the four processes above, in order to verify whether the data copy is complete, the user space of the A and B processes is calculated MD5 value before and after the copy begins. To compile the test program:

gcc zmd5.c memcpy_test.c -o memcpy_test

The test object raw data is a 1G size file, each method is tested 20 times, statistical sendfile process and the time of the getfile process, the sum of which is the time of the entire data copy, calculated average comparison. The code is located under the folder with a script runmemcpytest.sh to do this:

#!/bin/zsh for ((I=1;i<=20;i++)); do echo  "#"  $i >> ret_mem_1;./memcpy_test bigfile 1 >> Ret_mem _1; donefor ((I=1;i<=20;i++)); do echo  "#"  $i >> ret_mem_2;./memcpy_test bigfile 2 >> Ret_mem _2; done

Why not practical K-best method?

Since the test data came out and found all the data within a reasonable range, but the minimum value is significantly less than the average, from the actual meaning, not representative.

Test results

The calculated results are as follows:

In the process of Sendfile, memcpy is faster than Vmsplice, while vmsplice in GetFile is faster than memcpy.

The total time memcpy slightly better than Vmsplice, about 2.6% of the lead.

Conclusion

Data on the memcpy better than Vmsplice, but only if the memory is adequate, in our test even if the 1G data is still directly in the shared memory, which is not allowed in the actual programming, it is necessary to fragment the copy, will inevitably increase the lock consumption While Vmsplice is limited by the size of the pipe, the data is actually segmented, so Vmsplice is 1/3 less than memcpy in memory usage.

So my conclusion is that in pure memory copy this business scenario, memcpy and Vmsplice are basically equivalent, and no one can be apparent due to the opponent.

Splice function Family more suitable scenarios should be related to the disk data copy scenarios, if you are hesitant to use the Vmsplice method to replace the existing memcpy code, my advice is not required, memcpy is really fast enough.

If you want to reprint, please indicate the source, thank you!

One more THING

Updated 1.12

After the completion of the above-mentioned content, I found myself possessed. Sleep feel not reconciled, if not involved in the process of data copy, but the use of splice mechanism to completely replace the memcpy () function What is the case, then encapsulated a mymemcpy (), the function and parameters and memcpy exactly the same, the code is as follows:

void*mymemcpy (void*dest, const void*SRC, size_t N) {void*p = src,*q = dest; size_tSend size_t left = n; LONG ret; struct Iovec iov[2]; Long Nr_segs =1;int flags =0x1Pipe (PIPEFD);while (Left >0) {Send = left > pipe_buf? Pipe_buf:left; iov[0].iov_base = p; iov[0].iov_len =Send ret = Vmsplice (pipefd[1], &iov[0], Nr_segs, flags);if (ret = =-1) {perror ("Vmsplice failed");return NULL; } p + = ret; iov[1].iov_base =Q; Iov[1].iov_len = send; ret = Vmsplice (Pipefd[0], &iov[1], Nr_segs, flags); if (ret = =-1) {perror (" Vmsplice failed "); return NULL; } q + = ret; left-= ret; //printf ( "Mode[%d" Sendfile () Left=%d ret=%d\n ", mode, left, ret);} close (Pipefd[0]); close (Pipefd[1]); return dest;}

The process is mapped from SRC to pipe and then from pipe to dest, which is equal to the operation of the memcpy two times.

Oh, start compiling:

gcc zmd5.c mymemcpy.c -o mymemcpy

Generate a file of random content:

if=/dev/urandom of=random bs=1M count=1000

Execute the test script another runmymemcpytest.sh:

#!/bin/zshfor((i=1;i<=20;i++));do echo "#"$i >> ret_mymem_1; ./mymemcpy random >> ret_mymem_1;done

Extract the useful parts of the test data:

grep time

The results are as follows, because the results are very stable, I do not do statistical calculations, directly look at the original data:

memcpy () CPUTime0.140000smymemcpy () CPUTime0.240000smemcpy () CPUTime0.140000smymemcpy () CPUTime0.240000smemcpy () CPUTime0.150000smymemcpy () CPUTime0.230000smemcpy () CPUTime0.140000smymemcpy () CPUTime0.240000smemcpy () CPUTime0.140000smymemcpy () CPUTime0.250000smemcpy () CPUTime0.140000smymemcpy () CPUTime0.240000smemcpy () CPUTime0.140000smymemcpy () CPUTime0.240000smemcpy () CPUTime0.140000smymemcpy () CPUTime0.240000smemcpy () CPUTime0.140000smymemcpy () CPUTime0.240000smemcpy () CPUTime0.140000smymemcpy () CPUTime0.250000smemcpy () CPUTime0.140000smymemcpy () CPUTime0.240000smemcpy () CPUTime0.140000smymemcpy () CPUTime0.240000smemcpy () CPUTime0.140000smymemcpy () CPUTime0.250000smemcpy () CPUTime0.140000smymemcpy () CPUTime0.240000smemcpy () CPUTime0.140000smymemcpy () CPUTime0.240000smemcpy () CPUtime: 0.140000smymemcpy () CPU  Time: 0.250000smemcpy () CPU time:  0.140000smymemcpy () CPU time: 0.250000smemcpy () CPU time: 0.140000smymemcpy () CPU time: Span class= "Hljs-number" >0.240000smemcpy () CPU time:  0.140000smymemcpy () CPU time: 0.250000smemcpy () CPU time: 0.140000smymemcpy () CPU time: Span class= "Hljs-number" >0.240000s

You can see that the mymemcpy () method using the Vmsplice package consumes time (0.24~0.25) close to memcpy () Twice times (stable 0.14), it should also be proven that the memcpy process does not make user space to the kernel space of the copy, but directly between the user space. You can safely use the memcpy.

Test code and raw data

memcpy vs Vmsplice Test code

mymemcpy vs memcpy Test code

Test results raw data, retmem1, retmem2, and retmymem1 files GitHub address

Reprint: Memory copy memcpy () vs. Vmsplice () performance

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More