The effect of merging writing technology of modern CPU on program

Source: Internet
Author: User

For modern CPUs, the performance bottleneck is access to memory. CPU speeds are often at least two orders of magnitude higher than main memory. So the CPU has introduced L1_cache and L2_cache, and the higher-end CPUs have joined the L3_cache. Obviously, this technique raises the next question:

If the memory that a CPU needs to access at the time of execution is not in the cache, the CPU must be taken from the memory bus to main storage, so what happens when the data is returned to the CPU (at least two data levels at a time when the CPU executes hundreds or thousands of instructions)? The answer is that the CPU will continue to perform other qualifying instructions. For example, the CPU has a command sequence instruction 1 instruction 2 instruction 3 ..., in instruction 1 o'clock need to access the main memory, before the data return the CPU will continue to follow and instruction 1 on the logical relationship is not dependent on the "independent instructions", the CPU is generally dependent on the memory reference between instructions to determine the "independent relationship" between the instructions, Details can be found in each CPU's documentation. This is one of the root causes of the CPU order execution instructions.

The above scenario is the CPU's performance remedy for reading data latency. Writing data can be a little more complicated:

When the CPU executes the storage instruction, it first tries to write the data to the L1_cache closest to the CPU, and if the CPU L1 misses at this point, it accesses the next level of cache. The speed of the L1_cache basic and CPU flat, the other is significantly lower than the cpu,l2_cache speed of about 20-30 times slower than the CPU, but also there is a situation of l2_cache not hit, and need more cycles to main memory read. In fact, after L1_cache misses, the CPU uses an additional buffer called the merge write storage buffer. This technique is called merge write technology. When ownership of the request L2_cache cache line has not yet completed, the CPU writes the data to be written to the merge write storage buffer, which is typically 64 bytes in size and in a cache row size. This buffer allows the CPU to continue executing other instructions while writing or reading the buffer data, which alleviates the performance impact of the cache miss when the CPU writes data.

These buffers become very interesting when subsequent write operations need to modify the same cache rows. Buffer write merges can be performed before subsequent writes are committed to the L2 cache. These 64-byte buffers maintain a 64-bit field, and each byte of update sets the corresponding bit to indicate which data is valid when swapping buffers to the external cache. Of course, if a program reads some data that has been written to the buffer, it reads the cache data before reading the buffer.

After the above steps, the data in the buffer is updated to the external cache (L2_cache) at some point in time. If we can fill the buffer as much as possible before it is transferred to the cache, this effect will increase the efficiency of the transport bus at all levels to improve program performance.

Let's look at the following specific example:

The following test code, from the code itself can see its basic logic.

#include <unistd.h>

#include <stdio.h>

#include <sys/time.h>

#include <stdlib.h>

#include <limits.h>

static const int iterations = Int_max;

static const int items = 1<<24;

static int mask;

static int arraya[1<<24];

static int arrayb[1<<24];

static int arrayc[1<<24];

static int arrayd[1<<24];

static int arraye[1<<24];

static int arrayf[1<<24];

static int arrayg[1<<24];

static int arrayh[1<<24];

Double Run_one_case_for_8 ()

{

Double start_time;

Double end_time;

struct Timeval start;

struct Timeval end;

int i = iterations;

Gettimeofday (&start, NULL);

while (-I! = 0)

{

int slot = i & mask;

int value = i;

Arraya[slot] = value;

Arrayb[slot] = value;

Arrayc[slot] = value;

Arrayd[slot] = value;

Arraye[slot] = value;

Arrayf[slot] = value;

Arrayg[slot] = value;

Arrayh[slot] = value;

}

Gettimeofday (&end, NULL);

Start_time = (double) start.tv_sec + (double) start.tv_usec/1000000.0;

End_time = (double) end.tv_sec + (double) end.tv_usec/1000000.0;

return end_time-start_time;

}

Double Run_two_case_for_4 ()

{

Double start_time;

Double end_time;

struct Timeval start;

struct Timeval end;

int i = iterations;

Gettimeofday (&start, NULL);

while (-I! = 0)

{

int slot = i & mask;

int value = i;

Arraya[slot] = value;

Arrayb[slot] = value;

Arrayc[slot] = value;

Arrayd[slot] = value;

}

i = iterations;

while (-I! = 0)

{

int slot = i & mask;

int value = i;

Arrayg[slot] = value;

Arraye[slot] = value;

Arrayf[slot] = value;

Arrayh[slot] = value;

}

Gettimeofday (&end, NULL);

Start_time = (double) start.tv_sec + (double) start.tv_usec/1000000.0;

End_time = (double) end.tv_sec + (double) end.tv_usec/1000000.0;

return end_time-start_time;

}

int main ()

{

mask = items-1;

int i;

printf ("Test begin---->\n");

for (i=0;i<3;i++)

{

printf ("%d, Run_one_case_for_8:%lf\n", I, Run_one_case_for_8 ());

printf ("%d, run_two_case_for_4:%lf\n", I, Run_two_case_for_4 ());

}

printf ("Test End");

return 0;

}

I believe many people will think that the run_two_case_for_4 run time is certainly longer than the run_one_case_for_8, because at least the former more than one cycle of i++ operations. But that's not true: Here's the run:

Test environment: Fedora 64bits, 4G DDR3 memory, CPU:INTER®CORE™I7-3610QM CPU @2.30ghz.

The result was astonishing, and their performance gap was 1 time times as amazing.

Principle: The above mentioned merge write buffer is close to the CPU, the capacity is 64 bytes, very small, is estimated to be expensive. The number is also limited, I this CPU its number is 4. When the number is dependent on the CPU model, Intel's CPU can only get 4 at a time.

As a result, the Run_one_case_for_8 function writes 8 different memory points in a row, so when 4 data is filled with the merge write buffer, the CPU waits for the merge write buffer to be updated to L2cache, so the CPU is forced to suspend. In the Run_two_case_for_4 function, however, the memory is written to 4 different locations each time, which makes good use of the merge write buffer, because the number of CPU pauses caused by the combined write buffer is greatly reduced, of course, if the number of memory locations per write is less than 4. Although there is one more loop i++ operation (in fact you may ask, i++ also write memory ah, actually i this variable is saved on the register), but the performance gap between them is still very large.

As you can see from the above example, these CPU underlying features are not transparent to programmers. A slight change in the program can lead to significant performance gains. For storage-intensive programs, you should consider this to the feature.

I hope this article can bring you some help, but also can do performance optimization of colleagues to bring reference.

The effect of merging writing technology of modern CPU on program

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.