In many network development scenarios, memory conversion is often encountered in the following scenarios:
#define PACKAGE_PARSE_ERROR -1#define PACKAGE_PARSE_OK 0int parse_package( int* a, int* b, int* c, int* d, char* buf, int buf_len ){ if( !buf || buf_len < 16 ){ return PACKAGE_PARSE_ERROR; } memcpy( a, buf, 4 ); memcpy( b, buf + 4, 4 ); memcpy( c, buf + 8, 4 ); memcpy( d, buf + 12, 4 ); return PACKAGE_PARSE_OK;}
This is a call in the Process of network unpacking, and the packet process is a reverse process.
An application like this can be replaced by an integer forced conversion, and the efficiency will be at least doubled.
To illustrate the problem, we will give a simple example:
#include <stdio.h>#include <stdlib.h>#include <memory.h>int main(){ int s; char buffer[4]; memcpy(&s, buffer, 4 ); s = *(int*)(buffer); return 0;}
The effects of lines 10th and 11th are the same. 10 rows use memory replication and 11 rows use forced conversion. For convenience of comparison, let's take a look at the assembly code:
pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp .cfi_def_cfa_register 6 subq $16, %rsp leaq -16(%rbp), %rcx leaq -4(%rbp), %rax movl $4, %edx movq %rcx, %rsi movq %rax, %rdi call memcpy leaq -16(%rbp), %rax movl (%rax), %eax movl %eax, -4(%rbp) movl $0, %eax leave
The Code shows that the memory replication method occupies 7-12 rows, 6 rows in total, and the forced conversion occupies 13-15 rows, 3 rows in total, and half of the commands.
Further research is actually not limited, because the 12th rows are actually a function call and there will inevitably be stack migration, so the forced conversion efficiency is at least twice the memory replication efficiency.
Let's take a look at the memcpy function implementation of glibc:
void *memcpy (void *dstpp, const void *srcpp, size_t len ){ unsigned long int dstp = (long int) dstpp; unsigned long int srcp = (long int) srcpp; if (len >= OP_T_THRES) { len -= (-dstp) % OPSIZ; BYTE_COPY_FWD (dstp, srcp, (-dstp) % OPSIZ); PAGE_COPY_FWD_MAYBE (dstp, srcp, len, len); WORD_COPY_FWD (dstp, srcp, len, len); } BYTE_COPY_FWD (dstp, srcp, len); return dstpp;}
Lines 9-11 are three processing methods, depending on the comparison between Len and op_t_thres. Generally, op_t_thres is 8 or 16. For memory replication where Len is smaller than op_t_thres, glibc adopts byte conversion, that is, to traverse each byte, the first byte must go through the process of "memory-register-memory". The CPU instruction can be said to be twice as much as the flat space.
From the above analysis, we can see that forced conversion saves a lot of computing time, and the efficiency is at least doubled. Don't underestimate this improvement. In the case of tens of thousands of concurrent requests per second, especially each concurrency has a process of unpacking and encapsulation. Such processing can bring us a considerable performance improvement.
The unpackage process mentioned in the beginning can be forcibly converted in seconds. The following two methods are listed:
int parse_package( int* a, int* b, int* c, int* d, char* buf, int buf_len ){ if( !buf || buf_len < 16 ){ return PACKAGE_PARSE_ERROR; } memcpy( a, buf, 4 ); memcpy( b, buf + 4, 4 ); memcpy( c, buf + 8, 4 ); memcpy( d, buf + 12, 4 ); return PACKAGE_PARSE_OK;}
int parse_package2( int* a, int* b, int* c, int* d, char* buf, int buf_len ){ int* ibuf; if( !buf || buf_len < 16 ){ return PACKAGE_PARSE_ERROR; } ibuf = buf; *a = ibuf[0]; *b = ibuf[1]; *c = ibuf[2]; *d = ibuf[3]; return PACKAGE_PARSE_OK;}
Parse_package assembly code:
parse_package:.LFB0: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp .cfi_def_cfa_register 6 subq $48, %rsp movq %rdi, -8(%rbp) movq %rsi, -16(%rbp) movq %rdx, -24(%rbp) movq %rcx, -32(%rbp) movq %r8, -40(%rbp) movl %r9d, -44(%rbp) cmpq $0, -40(%rbp) je .L2 cmpl $15, -44(%rbp) jg .L3.L2: movl $-1, %eax jmp .L4.L3: movq -40(%rbp), %rcx movq -8(%rbp), %rax movl $4, %edx movq %rcx, %rsi movq %rax, %rdi call memcpy movq -40(%rbp), %rax leaq 4(%rax), %rcx movq -16(%rbp), %rax movl $4, %edx movq %rcx, %rsi movq %rax, %rdi call memcpy movq -40(%rbp), %rax leaq 8(%rax), %rcx movq -24(%rbp), %rax movl $4, %edx movq %rcx, %rsi movq %rax, %rdi call memcpy movq -40(%rbp), %rax leaq 12(%rax), %rcx movq -32(%rbp), %rax movl $4, %edx movq %rcx, %rsi movq %rax, %rdi call memcpy movl $0, %eax
The L3 section is our main section and assigned a value to:
The 24-28 lines are all in the "Pressure stack". For the memcpy function, the total number of 29 lines is 6. The number of memcpy stack commands is greater than or equal to 3, and the number of outgoing commands is greater than or equal to 4, if the return command is not included, the total number of commands is greater than 6 + 3 + 4 = 13.
Parse_package2 assembly code:
parse_package2:.LFB1: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp .cfi_def_cfa_register 6 movq %rdi, -24(%rbp) movq %rsi, -32(%rbp) movq %rdx, -40(%rbp) movq %rcx, -48(%rbp) movq %r8, -56(%rbp) movl %r9d, -60(%rbp) cmpq $0, -56(%rbp) je .L7 cmpl $15, -60(%rbp) jg .L8 .L7: movl $-1, %eax jmp .L9 .L8: movq -56(%rbp), %rax movq %rax, -8(%rbp) movq -8(%rbp), %rax movl (%rax), %edx movq -24(%rbp), %rax movl %edx, (%rax) movq -8(%rbp), %rax addq $4, %rax movl (%rax), %edx movq -32(%rbp), %rax movl %edx, (%rax) movq -8(%rbp), %rax addq $8, %rax movl (%rax), %edx movq -40(%rbp), %rax movl %edx, (%rax) movq -8(%rbp), %rax addq $12, %rax movl (%rax), %edx movq -48(%rbp), %rax movl %edx, (%rax) movl $0, %eax
L8 is the main paragraph and assigned a value to:
Lines 26-29, 4 in total.
In this example, the forced conversion (parse_package2) is two times less than the memory replication (parse_package), and the performance can be improved by at least two times.
Therefore, we should minimize the use of memory replication in our development, and adopt forced conversion. On 64-bit servers, we can even use 8 bytes of long, as shown below:
long lv;char buffer[ 8 ];memcpy( &lv, buffer, 8 );lv = *(long*)(buffer);
In this way, you can better use the multi-byte instructions of the CPU to improve performance.