In a complex underlying networkProgramMemory copy, string comparison, and search operations can easily become the performance bottleneck. Although the built-in functions of the compiler do some universal optimization work, they are far from making the maximum use of hardware because of compatibility constraints on instruction sets. By optimizing specific hardware platforms, the performance of such operations can be greatly improved. Next I will take the memory copy operation on the P4 platform as an example. Based on the example in an optimization document provided by AMD, I will briefly introduce how to optimize the memory bandwidth usage through specific instruction sets. Although the performance of the memcpy function 300% mentioned in the amd documentation is not improved due to hardware restrictions, however, our tests on the machine also showed significant performance improvement of % 175-% 200 (this data may vary depending on the machine situation ).
Optimizing memory bandwidth from amd
According to Moore's Law, the computing speed of the CPU is doubled every 18 months, but the speed of memory and external storage (hard disk) cannot grow at the same time. This leads to the non-synchronous development between high-speed CPU and relatively low-speed memory and external settings, becoming the bottleneck of many programs. How can we maximize the utilization of existing hardware?AlgorithmThe main approaches to optimization are as follows. For memory copy operations, it is critical to understand and use the cache properly. In order to pursue performance, we will sacrifice compatibility, so the following discussions andCodeAll are dominated by P4 and above CPUs. Although AMD chips are implemented in a different way, their instruction sets are the same as the overall structure.
First, let's take a look at the simplest compilation Implementation of memcpy:
references:
;
; flier Lu (flier@nsfocus.com)
;
; nasmw.exe-F Win32 fastmemcpy. ASM-O fastmemcpy. OBJ
;
; extern "C "{
; extern void fast_memcpy1 (void * DST, const void * SRC, size_t size );
;}
;
CPU p4
segment. Text use32
global _ fast_memcpy1
% define Param ESP + 8 + 4 % define SRC Param + 0 % define DST Param + 4 % define Len Param + 8
_ fast_memcpy1: push ESI push EDI
mov ESI, [SRC]; source array mov EDI, [DST]; destination array mov ECx, [Len]
rep movsb
pop EDI pop ESI RET
|
For code portability, I use the NASM format assembly code. NASM is an excellent open-source compilation compiler that supports various platforms and intermediate formats and is widely used in open-source projects, this avoids the use of VC embedded assembly and the troublesome Unix-style at&t format assembly in GCC.
The initial CPU P4 Definition of the code uses the P4 instruction set, because many subsequent optimization work uses the P4 Instruction Set and related features; the subsequent segment. text use32 defines this code in a 32-bit code segment; then global defines the label _ fast_memcpy1 as the global symbol so that C ++ code can link it. after OBJ, access this code. Finally, % define defines multiple macros for accessing function parameters.
In C ++, you only need to define the fast_memcpy1 Function Format and link the. OBJ file generated by NASM compilation. During NASM compilation, the-F parameter specifies that the generated intermediate file format is the 32-bit coff format of the MS, and the-O parameter specifies the output file name.
The above code is very simple and suitable for fast copying of small memory blocks. In fact, when the VC compiler processes small memory copies, it will automatically replace the memcpy function with rep movsb based on actual conditions. By ignoring function calls and stack operations, the code length and performance are optimized.
However, in a 32-bit X86 architecture, there is no need to perform operations by byte. Replacing movsb with movsd is an inevitable choice.
Reference:
Global _ fast_memcpy2
% Define Param ESP + 8 + 4 % Define SRC Param + 0 % Define DST Param + 4 % Define Len Param + 8
_ Fast_memcpy2: Push ESI Push EDI
MoV ESI, [SRC]; source array MoV EDI, [DST]; destination Array MoV ECx, [Len] SHR ECx, 2; convert to DWORD count
Rep movsd
Pop EDI Pop ESI RET
|
For ease of display, it is assumed that the source and target memory blocks are an integer multiple of 64 bytes and have been 4 K page aligned. The former ensures that a single command will not be accessed across cache lines; the latter ensures that the test results will not be affected by cross-page operations during testing. When the analysis of the cache, I will explain in detail why I want to make this assumption.
However, most of the modern CPUs use a long command line, and multiple commands work in parallel more efficiently than one command. Therefore, the amd documentation provides the following optimization:
references:
Global _ fast_memcpy3
% define Param ESP + 8 + 4 % define SRC Param + 0 % define DST Param + 4 % define Len Param + 8
_ fast_memcpy3: push ESI push EDI
mov ESI, [SRC]; source array mov EDI, [DST]; destination array mov ECx, [Len] SHR ECx, 2; convert to DWORD count
. copyloop: mov eax, DWORD [esi] mov DWORD [EDI], eax
Add ESI, 4 Add EDI, 4
dec ECx jnz. copyloop
pop EDI pop ESI RET
|
In tag. copyloop, the loop actually completes exactly the same work as the rep movsd command, but because it is multiple commands, the CPU command line can be processed in parallel theoretically. Therefore, the amd documentation points out that the performance can be improved by 1.5%, but the actual test results are not obvious. Relatively speaking, the difference between the two methods was quite obvious when migrating from 486 to the Pentium architecture. I remember that in Delphi 3 or 4, only through this optimization, its string processing performance will be greatly improved. Currently, mainstream CPU vendors use micro-code technology to simulate the CISC instruction set by using a single-core CPU command. Therefore, the effect is not obvious.
Then, you can increase the amount of data processed each time and reduce the number of cycles through the optimization strategy to improve performance.
references:
Global _ fast_memcpy4
% define Param ESP + 8 + 4 % define SRC Param + 0 % define DST Param + 4 % define Len Param + 8
_ fast_memcpy4: push ESI push EDI
mov ESI, [SRC]; source array mov EDI, [DST]; destination array mov ECx, [Len] SHR ECx, 4; convert to 16-byte size count
. copyloop: mov eax, DWORD [esi] mov DWORD [EDI], eax
mov EBX, DWORD [ESI + 4] mov DWORD [EDI + 4], EBX
mov eax, DWORD [ESI + 8] mov DWORD [EDI + 8], eax
mov EBX, DWORD [ESI + 12] mov DWORD [EDI + 12], EBX
Add ESI, 16 Add EDI, 16
dec ECx jnz. copyloop
pop EDI pop ESI RET
|
however, this operation is evaluated in amd documentation, but the performance is reduced by % 1.5. In its own statement, it is necessary to group the READ memory and write memory operations so that the CPU can do it all at once. The following group operation can be upgraded by 3%-_-B higher than _ fast_memcpy3
Reference:
Global _ fast_memcpy5
% Define Param ESP + 8 + 4 % Define SRC Param + 0 % Define DST Param + 4 % Define Len Param + 8
_ Fast_memcpy5: Push ESI Push EDI
MoV ESI, [SRC]; source array MoV EDI, [DST]; destination Array MoV ECx, [Len] SHR ECx, 4; convert to 16-byte size count
. Copyloop: MoV eax, DWORD [esi] MoV EBX, DWORD [ESI + 4] MoV DWORD [EDI], eax MoV DWORD [EDI + 4], EBX
MoV eax, DWORD [ESI + 8] MoV EBX, DWORD [ESI + 12] MoV DWORD [EDI + 8], eax MoV DWORD [EDI + 12], EBX
Add ESI, 16 Add EDI, 16
Dec ECx Jnz. copyloop
Pop EDI Pop ESI RET
|
Unfortunately, there is no difference in the actual test on P4. Haha, there is a slight difference in the idea of implementing the pipeline between P4 and AMD.
Why not expand them more? Although there are only a few general registers under x86, MMX is available now. Haha, a lot of registers can be renamed. After MMX registers are used, a single loading/writing operation can process 64 bytes of data, haha, there is a 7% performance improvement over _ fast_memcpy5.
Reference:
Global _ fast_memcpy6
% Define Param ESP + 8 + 4 % Define SRC Param + 0 % Define DST Param + 4 % Define Len Param + 8
_ Fast_memcpy6: Push ESI Push EDI
MoV ESI, [SRC]; source array MoV EDI, [DST]; destination Array MoV ECx, [Len]; number of qwords (8 bytes) assumes Len/cacheblock is an integer SHR ECx, 3
Lea ESI, [ESI + ECx * 8]; end of Source Lea EDI, [EDI + ECx * 8]; end of destination Neg ECx; use a negative offset as a combo pointer-and-loop-Counter
. Copyloop: Movq mm0, qword [ESI + ECx * 8] Movq MM1, qword [ESI + ECx * 8 + 8] Movq mm2, qword [ESI + ECx * 8 + 16] Movq mm3, qword [ESI + ECx * 8 + 24] Movq mm4, qword [ESI + ECx * 8 + 32] Movq MM5, qword [ESI + ECx * 8 + 40] Movq mm6, qword [ESI + ECx * 8 + 48] Movq mm7, qword [ESI + ECx * 8 + 56]
Movq qword [EDI + ECx * 8], mm0 Movq qword [EDI + ECx * 8 + 8], MM1 Movq qword [EDI + ECx * 8 + 16], mm2 Movq qword [EDI + ECx * 8 + 24], mm3 Movq qword [EDI + ECx * 8 + 32], mm4 Movq qword [EDI + ECx * 8 + 40], MM5 Movq qword [EDI + ECx * 8 + 48], mm6 Movq qword [EDI + ECx * 8 + 56], mm7
Add ECx, 8 Jnz. copyloop
Emms
Pop EDI Pop ESI
RET
|
To this end, the conventional optimization methods have basically been exhausted and need to be used with great means.
Let's look back at the cache structure under the P4 architecture.
The IA-32 intel architecture software developer's manual, Volume 3: System Programming Guide
Chapter 10 of Intel's System Change manual describes the Memory Cache Control in the ia32 architecture. Because of the huge gap between the CPU speed and the memory speed, the CPU vendor improves the access speed of frequently used data by built-in and external multi-level caching In the CPU. Generally, there are L1, L2, and L3 caches between the CPU and memory (several other TLB caches are not involved here). The cache speed varies by an order of magnitude, there is also a big difference in capacity (actually related to $, huh, huh), and L1 cache is subdivided into command cache and data cache for different purposes. For P4 and Xeon processors, L1 instruction cache is replaced by trace cache, which is built into the netbust microarchitecture. L1 data cache and L2 cache are encapsulated in the CPU, depending on the CPU grade, the cache is between 8-16 K and-K, while the L3 cache is only implemented in the Xeon processor and encapsulated in the CPU, which is about-M.
You can view the cache information of the current machine by viewing the CPU information software such as cpuinfo. For example, my system is:
P4 1.7g, 8 K L1 code cache, 12 K L1 data cache, 256 k L2 cache.
The cache is implemented by several rows (slot or line). Each row corresponds to the continuous data on an address in the memory, and the cache manager controls data loading and hit in read/write. The principle is not long here. If you are interested, you can check the Intel manual on your own. You need to know that each slot is 32 bytes before P4, and P4 is changed to 64 bytes. The operations on the cache row are complete. Even if you read only one byte, You need to load the entire cache row (64 bytes). The subsequent optimization is largely based on these principles.
In terms of the cache working mode, there are six types of P4 support, so we will not describe them one by one here. The impact on our optimization is actually the performance of the cache when writing memory. The most common wt (write-through) Write-through Mode updates data to the cache while writing data to the memory; while WB (write-back) Write-back mode, then it is directly written to the cache, and does not perform slow memory read/write at the moment. These two modes have significant performance differences in Processing Memory variables that are frequently operated (millions of times per second. For example, by writing a driver module to operate mtrr to forcibly open the WB mode, the Linux NIC Driver has received good results, but it is not helpful for optimizing memory replication, what we need is to completely skip the operations on the cache, whether it is cache locating, loading or writing.
Fortunately, P4 provides the movntq command and uses the WC (write-combining) mode to skip the cache and write the memory directly. Because our write memory operations are purely write operations, the written data will not be used at all for a certain period of time, whether in WT or WB mode, there will be redundant cache operations. The optimization code is as follows:
Reference:
Global _ fast_memcpy7
% Define Param ESP + 8 + 4 % Define SRC Param + 0 % Define DST Param + 4 % Define Len Param + 8
_ Fast_memcpy7: Push ESI Push EDI
MoV ESI, [SRC]; source array MoV EDI, [DST]; destination Array MoV ECx, [Len]; number of qwords (8 bytes) assumes Len/cacheblock is an integer SHR ECx, 3
Lea ESI, [ESI + ECx * 8]; end of Source Lea EDI, [EDI + ECx * 8]; end of destination Neg ECx; use a negative offset as a combo pointer-and-loop-Counter
. Copyloop: Movq mm0, qword [ESI + ECx * 8] Movq MM1, qword [ESI + ECx * 8 + 8] Movq mm2, qword [ESI + ECx * 8 + 16] Movq mm3, qword [ESI + ECx * 8 + 24] Movq mm4, qword [ESI + ECx * 8 + 32] Movq MM5, qword [ESI + ECx * 8 + 40] Movq mm6, qword [ESI + ECx * 8 + 48] Movq mm7, qword [ESI + ECx * 8 + 56]
Movntq qword [EDI + ECx * 8], mm0 Movntq qword [EDI + ECx * 8 + 8], MM1 Movntq qword [EDI + ECx * 8 + 16], mm2 Movntq qword [EDI + ECx * 8 + 24], mm3 Movntq qword [EDI + ECx * 8 + 32], mm4 Movntq qword [EDI + ECx * 8 + 40], MM5 Movntq qword [EDI + ECx * 8 + 48], mm6 Movntq qword [EDI + ECx * 8 + 56], mm7
Add ECx, 8 Jnz. copyloop
Sfence; flush write buffer Emms
Pop EDI Pop ESI
RET
|
All the movq commands written to the memory are changed to the movntq command. After the replication operation is complete, the sfence command is called to refresh the write cache, because the content in the cache may have become invalid. In this way, the operation of loading the cache outside the write memory and the operation of the cache itself are all omitted, greatly reducing the redundant memory operation. According to AMD, performance can be improved by 60%, and I have also measured a significant performance improvement by 50% on the left and right.
For instructions such as movntq and sfence, refer to Intel's instruction manual:
The IA-32 intel architecture software developer's manual, Volume 2a: Instruction Set Reference, A-M
The IA-32 intel architecture software developer's manual, Volume 2b: Instruction Set Reference, N-Z
After optimizing the write memory, you can also optimize the Read Memory operations to improve performance. Although the CPU has an automatic pre-read optimization when reading data, it explicitly requires that the CPU pre-read data be operated in the continuous memory area, which can also significantly optimize the performance.
Reference:
Global _ fast_memcpy8
% Define Param ESP + 8 + 4 % Define SRC Param + 0 % Define DST Param + 4 % Define Len Param + 8
_ Fast_memcpy8: Push ESI Push EDI
MoV ESI, [SRC]; source array MoV EDI, [DST]; destination Array MoV ECx, [Len]; number of qwords (8 bytes) assumes Len/cacheblock is an integer SHR ECx, 3
Lea ESI, [ESI + ECx * 8]; end of Source Lea EDI, [EDI + ECx * 8]; end of destination Neg ECx; use a negative offset as a combo pointer-and-loop-Counter
. Writeloop: Prefetchnta [ESI + ECx * 8 + 512]; fetch ahead by 512 bytes
Movq mm0, qword [ESI + ECx * 8] Movq MM1, qword [ESI + ECx * 8 + 8] Movq mm2, qword [ESI + ECx * 8 + 16] Movq mm3, qword [ESI + ECx * 8 + 24] Movq mm4, qword [ESI + ECx * 8 + 32] Movq MM5, qword [ESI + ECx * 8 + 40] Movq mm6, qword [ESI + ECx * 8 + 48] Movq mm7, qword [ESI + ECx * 8 + 56]
Movntq qword [EDI + ECx * 8], mm0 Movntq qword [EDI + ECx * 8 + 8], MM1 Movntq qword [EDI + ECx * 8 + 16], mm2 Movntq qword [EDI + ECx * 8 + 24], mm3 Movntq qword [EDI + ECx * 8 + 32], mm4 Movntq qword [EDI + ECx * 8 + 40], MM5 Movntq qword [EDI + ECx * 8 + 48], mm6 Movntq qword [EDI + ECx * 8 + 56], mm7
Add ECx, 8 Jnz. writeloop
Sfence; flush write buffer Emms
Pop EDI Pop ESI
RET
|
Add a simple prefetchnta command, prompting the CPU to pre-read a cache row of 64 bytes at the first 512 bytes while processing the current READ memory operation. In this way, performance can be improved by about 10%.
Finally, you can use an explicit memory read operation to load the memory being processed to the cache, because the prefetchnta command is only a prompt and can be ignored by the CPU. In this way, we can get a performance tip of about 60% again. I did not test this high, but it is also obvious.
Reference:
Global _ fast_memcpy9
% Define Param ESP + 12 + 4 % Define SRC Param + 0 % Define DST Param + 4 % Define Len Param + 8
% Define cacheblock 400 h
_ Fast_memcpy9: Push ESI Push EDI Push EBX
MoV ESI, [SRC]; source array MoV EDI, [DST]; destination Array MoV ECx, [Len]; number of qwords (8 bytes) assumes Len/cacheblock is an integer SHR ECx, 3
Lea ESI, [ESI + ECx * 8]; end of Source Lea EDI, [EDI + ECx * 8]; end of destination Neg ECx; use a negative offset as a combo pointer-and-loop-Counter
. Mainloop: MoV eax, cacheblock/16; Note:. prefetchloop is unrolled 2x Add ECx, cacheblock; move up to end of Block
. Prefetchloop: MoV EBX, [ESI + ECx * 8-64]; read one address in this cache line... MoV EBX, [ESI + ECx * 8-128];... and one in the previous line Sub ECx, 16; 16 qwords = 2 64-byte cache lines Dec eax Jnz. prefetchloop
MoV eax, cacheblock/8
. Writeloop: Prefetchnta [ESI + ECx * 8 + 512]; fetch ahead by 512 bytes
Movq mm0, qword [ESI + ECx * 8] Movq MM1, qword [ESI + ECx * 8 + 8] Movq mm2, qword [ESI + ECx * 8 + 16] Movq mm3, qword [ESI + ECx * 8 + 24] Movq mm4, qword [ESI + ECx * 8 + 32] Movq MM5, qword [ESI + ECx * 8 + 40] Movq mm6, qword [ESI + ECx * 8 + 48] Movq mm7, qword [ESI + ECx * 8 + 56]
Movntq qword [EDI + ECx * 8], mm0 Movntq qword [EDI + ECx * 8 + 8], MM1 Movntq qword [EDI + ECx * 8 + 16], mm2 Movntq qword [EDI + ECx * 8 + 24], mm3 Movntq qword [EDI + ECx * 8 + 32], mm4 Movntq qword [EDI + ECx * 8 + 40], MM5 Movntq qword [EDI + ECx * 8 + 48], mm6 Movntq qword [EDI + ECx * 8 + 56], mm7
Add ECx, 8 Dec eax Jnz. writeloop
Or ECX, ECx; assumes integer number of cacheblocks Jnz. mainloop
Sfence; flush write buffer Emms
Pop EBX Pop EDI Pop ESI
RET
|
So far, the optimization process of a complete memory replication function is over. By understanding and using the cache, we surpass ourselves again and finally get a satisfactory result. (It is known as the 300% performance prompt. The actual measurement is 175%-200%, which is quite good)
Note the following when writing test code:
First, the timing precision problem requires the use of high-precision physical counters to avoid errors. We recommend that you use the rdtsc command and calculate the time based on the CPU clock speed. The CPU clock speed can be dynamically calculated using a high-precision timer. I am too lazy to read it directly from the registry.
The Code is as follows:
Reference:
# Ifdef Win32
Typedef _ int64 uint64_t;
# Else
# Include <stdint. h>
# Endif
Bool getpentiumclockestimatefromregistry (uint64_t & frequency) { Hkey;
Frequency = 0;
Long rc =: regopenkeyex (HKEY_LOCAL_MACHINE, "Hardware \ description \ System \ centralprocessor \ 0", 0, key_read, & hkey );
If (rc = error_success) { DWORD cbbuffer = sizeof (DWORD ); DWORD freq_mhz;
Rc =: regqueryvalueex (hkey ,"~ MHz ", null, null, (lpbyte) (& freq_mhz), & cbbuffer );
If (rc = error_success) Frequency = freq_mhz * Mega;
Regclosekey (hkey ); }
Return frequency> 0; }
Void gettimestamp (uint64_t & timestamp) { # Ifdef Win32 _ ASM { Push edX Push ECx MoV ECx, timestamp // _ Emit 0fh // rdtsc // _ Emit 31 H Rdtsc MoV [ECx], eax MoV [ECx + 4], EDX Pop ECx Pop edX } # Else _ ASM _ volatile _ ("rdtsc": "= A" (timestamp )); # Endif }
|
The second is to test the buffer size for memory replication. If the buffer size is too small, the first copy of the two buffers will cause all data to be loaded into the L2 cache, obtains a value of an order of magnitude higher than the general memory operation. For example, my L2 buffer is 256 K. If I use two K buffers for copying, the speed will be 10 times the normal memory replication speed, no matter how many times the loop is made. Therefore, it is necessary to set a large value.