Memcpy thinking about the flexible use of memory conversion for High-concurrency servers

Source: Internet
Author: User
Tags subq

In many network development scenarios, memory conversion is often encountered in the following scenarios:

#define PACKAGE_PARSE_ERROR -1#define PACKAGE_PARSE_OK 0int parse_package( int* a, int* b, int* c, int* d, char* buf, int buf_len ){        if( !buf || buf_len < 16 ){                return PACKAGE_PARSE_ERROR;        }        memcpy( a, buf, 4 );        memcpy( b, buf + 4, 4 );        memcpy( c, buf + 8, 4 );        memcpy( d, buf + 12, 4 );        return PACKAGE_PARSE_OK;}

This is a call in the Process of network unpacking, and the packet process is a reverse process.

An application like this can be replaced by an integer forced conversion, and the efficiency will be at least doubled.

To illustrate the problem, we will give a simple example:

#include <stdio.h>#include <stdlib.h>#include <memory.h>int main(){        int s;        char buffer[4];        memcpy(&s, buffer, 4 );        s = *(int*)(buffer);        return 0;}

The effects of lines 10th and 11th are the same. 10 rows use memory replication and 11 rows use forced conversion. For convenience of comparison, let's take a look at the assembly code:

        pushq   %rbp        .cfi_def_cfa_offset 16        .cfi_offset 6, -16        movq    %rsp, %rbp        .cfi_def_cfa_register 6        subq    $16, %rsp        leaq    -16(%rbp), %rcx        leaq    -4(%rbp), %rax        movl    $4, %edx        movq    %rcx, %rsi        movq    %rax, %rdi        call    memcpy        leaq    -16(%rbp), %rax        movl    (%rax), %eax        movl    %eax, -4(%rbp)        movl    $0, %eax        leave

The Code shows that the memory replication method occupies 7-12 rows, 6 rows in total, and the forced conversion occupies 13-15 rows, 3 rows in total, and half of the commands.

Further research is actually not limited, because the 12th rows are actually a function call and there will inevitably be stack migration, so the forced conversion efficiency is at least twice the memory replication efficiency.

Let's take a look at the memcpy function implementation of glibc:

void *memcpy (void *dstpp, const void *srcpp, size_t len ){  unsigned long int dstp = (long int) dstpp;  unsigned long int srcp = (long int) srcpp;  if (len >= OP_T_THRES)    {      len -= (-dstp) % OPSIZ;      BYTE_COPY_FWD (dstp, srcp, (-dstp) % OPSIZ);      PAGE_COPY_FWD_MAYBE (dstp, srcp, len, len);      WORD_COPY_FWD (dstp, srcp, len, len);    }  BYTE_COPY_FWD (dstp, srcp, len);  return dstpp;}

Lines 9-11 are three processing methods, depending on the comparison between Len and op_t_thres. Generally, op_t_thres is 8 or 16. For memory replication where Len is smaller than op_t_thres, glibc adopts byte conversion, that is, to traverse each byte, the first byte must go through the process of "memory-register-memory". The CPU instruction can be said to be twice as much as the flat space.

From the above analysis, we can see that forced conversion saves a lot of computing time, and the efficiency is at least doubled. Don't underestimate this improvement. In the case of tens of thousands of concurrent requests per second, especially each concurrency has a process of unpacking and encapsulation. Such processing can bring us a considerable performance improvement.

The unpackage process mentioned in the beginning can be forcibly converted in seconds. The following two methods are listed:

int parse_package( int* a, int* b, int* c, int* d, char* buf, int buf_len ){        if( !buf || buf_len < 16 ){                return PACKAGE_PARSE_ERROR;        }        memcpy( a, buf, 4 );        memcpy( b, buf + 4, 4 );        memcpy( c, buf + 8, 4 );        memcpy( d, buf + 12, 4 );        return PACKAGE_PARSE_OK;}

int parse_package2( int* a, int* b, int* c, int* d, char* buf, int buf_len ){        int* ibuf;        if( !buf || buf_len < 16 ){                return PACKAGE_PARSE_ERROR;        }        ibuf = buf;        *a = ibuf[0];        *b = ibuf[1];        *c = ibuf[2];        *d = ibuf[3];        return PACKAGE_PARSE_OK;}

Parse_package assembly code:

parse_package:.LFB0:        .cfi_startproc        pushq   %rbp        .cfi_def_cfa_offset 16        .cfi_offset 6, -16        movq    %rsp, %rbp        .cfi_def_cfa_register 6        subq    $48, %rsp        movq    %rdi, -8(%rbp)        movq    %rsi, -16(%rbp)        movq    %rdx, -24(%rbp)        movq    %rcx, -32(%rbp)        movq    %r8, -40(%rbp)        movl    %r9d, -44(%rbp)        cmpq    $0, -40(%rbp)        je      .L2        cmpl    $15, -44(%rbp)        jg      .L3.L2:        movl    $-1, %eax        jmp     .L4.L3:        movq    -40(%rbp), %rcx        movq    -8(%rbp), %rax        movl    $4, %edx        movq    %rcx, %rsi        movq    %rax, %rdi        call    memcpy        movq    -40(%rbp), %rax        leaq    4(%rax), %rcx        movq    -16(%rbp), %rax        movl    $4, %edx        movq    %rcx, %rsi        movq    %rax, %rdi        call    memcpy        movq    -40(%rbp), %rax        leaq    8(%rax), %rcx        movq    -24(%rbp), %rax        movl    $4, %edx        movq    %rcx, %rsi        movq    %rax, %rdi        call    memcpy        movq    -40(%rbp), %rax        leaq    12(%rax), %rcx        movq    -32(%rbp), %rax        movl    $4, %edx        movq    %rcx, %rsi        movq    %rax, %rdi        call    memcpy        movl    $0, %eax

The L3 section is our main section and assigned a value to:

The 24-28 lines are all in the "Pressure stack". For the memcpy function, the total number of 29 lines is 6. The number of memcpy stack commands is greater than or equal to 3, and the number of outgoing commands is greater than or equal to 4, if the return command is not included, the total number of commands is greater than 6 + 3 + 4 = 13.

Parse_package2 assembly code:

parse_package2:.LFB1:        .cfi_startproc        pushq   %rbp        .cfi_def_cfa_offset 16        .cfi_offset 6, -16        movq    %rsp, %rbp        .cfi_def_cfa_register 6        movq    %rdi, -24(%rbp)        movq    %rsi, -32(%rbp)        movq    %rdx, -40(%rbp)        movq    %rcx, -48(%rbp)        movq    %r8, -56(%rbp)        movl    %r9d, -60(%rbp)        cmpq    $0, -56(%rbp)        je      .L7             cmpl    $15, -60(%rbp)        jg      .L8     .L7:        movl    $-1, %eax        jmp     .L9     .L8:        movq    -56(%rbp), %rax        movq    %rax, -8(%rbp)        movq    -8(%rbp), %rax        movl    (%rax), %edx        movq    -24(%rbp), %rax        movl    %edx, (%rax)        movq    -8(%rbp), %rax        addq    $4, %rax        movl    (%rax), %edx        movq    -32(%rbp), %rax        movl    %edx, (%rax)        movq    -8(%rbp), %rax        addq    $8, %rax        movl    (%rax), %edx        movq    -40(%rbp), %rax        movl    %edx, (%rax)        movq    -8(%rbp), %rax        addq    $12, %rax        movl    (%rax), %edx        movq    -48(%rbp), %rax        movl    %edx, (%rax)        movl    $0, %eax

L8 is the main paragraph and assigned a value to:

Lines 26-29, 4 in total.

In this example, the forced conversion (parse_package2) is two times less than the memory replication (parse_package), and the performance can be improved by at least two times.

Therefore, we should minimize the use of memory replication in our development, and adopt forced conversion. On 64-bit servers, we can even use 8 bytes of long, as shown below:

long lv;char buffer[ 8 ];memcpy( &lv, buffer, 8 );lv = *(long*)(buffer);

In this way, you can better use the multi-byte instructions of the CPU to improve performance.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.