The CPU cache is aligned with the memory

Last Update:2016-12-11 Source: Internet

Author: User

Tags int size prefetch

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Transferred from: http://blog.csdn.net/zhang_shuai_2011/article/details/38119657

The original text reads as follows:

A. Cache
Cache in general, there are a few things to care about:
1) Cache Hierarchy
The cache hierarchy, generally L1, L2, L3 (L is the meaning of level) of the cache. Generally speaking, L1,L2 is integrated into the CPU (which can be called the on-chip cache), and L3 is placed outside the CPU (which can be called the off-chip cache). Of course this is not absolute, different CPUs may not be the same approach. This should also include the register, although register is not the cache, but putting the data into the register can improve performance.
2) Cache Size
The capacity of the cache determines how much code and data can be put into the cache, with the cache to compete, only to replace, to have the space to optimize. If a program hotspot (hotspot) is fully populated with the whole cache, then it is useless to consider optimization from the cache perspective, paddle. The goal of our optimizer is to put the program into the cache as much as possible, but it is difficult to write the program to fill the entire cache, with a code path of this size, how much code is necessary, and the logic is certainly quite complex (basically impossible, at least I haven't seen it).
3) Cache Line size
The CPU load data from memory is one cache line at a time; writing to memory is also a cache line at a time, so the data inside a cache line is best read/write separated, otherwise it will affect each other.
4) Cache Associative
The Association of the cache. Full associative, memory can be mapped to any one of the cache line, there is also N-way association, this is a hash table structure, n is the length of the conflict chain, more than N, you need to replace.
5) Cache Type
There are i-cache (command cache), D-cache (data cache), TLB (MMU cache), each with L1,L2, and so on, there are caches that differentiate between instructions and data, and caches that do not differentiate between instructions and data.
Two. Optimization of code levels
1) byte alignment (byte-aligned)
To understand byte alignment, you first have to understand the organizational structure of system memory. The 1 memory units are called 1 bytes, and the bytes are then composed of words, in 8086 times, 16 bits of the machine 1 words = 2 bytes =16bit, and 80386 after the 32-bit system, 1 words = 4 bytes. Most computer instructions operate on words, such as adding two words together. That is, the 32-bit CPU register is 32 bits, resulting in the operation of the instruction object is 32-bit word, 16-bit CPU register is 16 bits, moving, adding, subtraction and other instructions are also 16-bit characters. Because of the instructions, the memory is also addressed by the word, in the 16-bit system, if you access a low 8-bit, memory addressing or 16-bit, and then select the lower 8-bit or high 8-bit according to the A0 address line, this process becomes a memory read (write), in the 16-bit system, If you read a 32-digit number, it takes two memory read cycles (16 read, then 16 higher). In the same vein, memory addressing for 32-bit CPUs is performed in 4 units.
For efficient purposes, in 16-bit systems, the starting address of a variable store is a multiple of 2, and the starting address of a variable store in a 32-bit system is a multiple of 4, which is done by the compiler. Here's an example to illustrate the problem. As shown in: Cache alignment and byte alignment-CR7-CR7 's Blog
is the memory layout of the 16-bit system, dark blue indicates the memory range covered by the variable, assuming that the variable size is 2 bytes, the variable starting physical memory address is 0000H, when accessing this variable, only need to read and write memory once. However, when the memory start address of the variable is 0001H, the CPU consumes two read cycles for variable access, as follows: In order to access the low 8 bits of the variable, the CPU will access the word with the starting address of 0000H by addressing, and then find the high 8 bits of the current word The CPU then accesses the lower 8 bits of the word 0002H, which is the high 8 bits of the variable, so that the access to the CPU's assembled variable ends, so that the value of the variable can be accessed correctly by two read cycles, and the efficiency is 1/2 of the former.
__ATTRIBUTE__ ((aligned (n))) indicates that the defined variable is n-byte aligned;
Byte alignment details are related to compiler implementations, but in general, three guidelines are met:
1) The first address of a variable (struct) can be divisible by the size of its (widest) base type member;
2) the offsets (offset) of each member of the struct relative to the first address of the struct are an integer multiple of the member size, and if necessary, the compiler adds padding bytes between the members (internal adding);
3) The total size of the struct is an integer multiple of the size of the structure's widest base type member, and if necessary the compiler will add padding bytes (trailing padding) after the last member.
The role of __attribute__ (packed) is to tell the compiler to cancel the optimization alignment of the structure during compilation and to align it with the actual number of bytes consumed, which is the specific syntax of GCC. This function is not the same as the operating system, with the compiler, the GCC compiler is not compact mode. For example:
The role of __attribute__ (packed) is to tell the compiler to cancel the optimization alignment of the structure during compilation and to align it with the actual number of bytes consumed, which is the specific syntax of GCC.
__ATTRIBUTE__ ((aligned (n))) indicates that the defined variable is n-byte aligned;
struct b{char b;int a;short c;}; (Default 4-byte alignment)
This is also a total of 7 byte variables, but sizeof (struct B) has a value of 12.
Below we use the precompiled Instruction __attribute__ ((aligned (n))) to tell the compiler to replace the default with our specified alignment value:
struct C{char b;int a;short C;}; __ATTRIBUTE__ ((Aligned (2)))
This is also a total of 7 bytes of variable, but the value of sizeof (struct B) is 8
struct d{char b;int a;short c;}; __ATTRIBUTE__ ((Packed))
The sizeof (struct C) value is 8,sizeof (struct D) with a value of 7.
Byte alignment details are related to compiler implementations, but in general, three guidelines are met:
1) The first address of a variable (struct) can be divisible by the size of its (widest) base type member;
2) the offsets (offset) of each member of the struct relative to the first address of the struct are an integer multiple of the member size, and if necessary, the compiler adds padding bytes between the members (internal adding);
3) The total size of the struct is an integer multiple of the size of the structure's widest base type member, and if necessary the compiler will add padding bytes (trailing padding) after the last member.
2) Cache line alignment (cache alignment)
The data spans two cache line, which means two load or two times store. If the data structure is aligned with the cache line, it is possible to reduce read and write. The first address of the data structure, cache line alignment, means that there may be memory waste (especially in arrays such as the continuous allocation of data structures), so you need to weigh both space and time.
For normal code, memory boundary alignment can also be beneficial, reducing the number of cache and memory exchange data. The main problem is that the cache itself is divided into many cache-line, each cache-line has a certain length, such as generally L1 cache line length of 32 bytes or 64 bytes, and L2 will be larger, such as 64 bytes or 128 bytes. Each time a user accesses a variable in the address space, if it is not in the cache, the data needs to be transferred to the cache from memory first.
For example, there is now a variable int x, which takes up 4 bytes and its starting address is 0x1234567f, so it occupies a memory range between 0x1234567f-0x12345682. If the cache line is now 32 bytes long, then each time the memory is exchanged with the cache, it must be swapped with the cache line at the beginning of the memory location at the start address (0x20) multiples, starting with a length of 32 memory. For example, 0x1234567f falls on the range 0x12345660~0x1234567f, but 0x12345680~0x12345682 falls on the range 0x12345680~0x1234569f, which means In order to load the 4-byte integer variable 0x1234567f~0x12345682 into the cache, we must transfer two cache line data. However, if the starting address of int x is aligned in multiples of 4, such as 0x1234567c~0x1234567f, then it will inevitably fall on a cache line, so it is necessary to load only one cache line of data per access to the variable x. For example, the malloc () function now returns a memory address that is already 8-byte aligned, in order to enable most programs to have better performance.
1. __ATTRIBUTE__ ((Aligned (cache_line))) alignment implementation;
struct SYN_STR {ints_variable;}; __ATTRIBUTE__ ((Aligned (Cache_line)));
2. Algorithm implementation

Introduction
int A;
int size = 8; <----> (BIN)
Calculates the lower bound number of a in multiples of size:
Let this number (the number to be counted) be expressed as a binary, the last three bits being 0 can achieve this goal. As long as the following number with a "and operation" can be:
11111111 11111111 11111111 11111000
The above number is actually the ~ (size-1), which can be called the size of the alignment mask Size_mask.
Calculates the number of upper and lower bounds for a in multiples of size:
#define Alignment_down (A, size) (A & (~ (size-1)))
#define ALIGNMENT_UP (A, size) ((a+size-1) & (~ (size-1)))
Note: The upper bound number of the calculation method, if the requirements of a larger than the need to add 8 on it? But if a is itself a multiple of 8, so add 8 is wrong, so on a basis plus (size-1), and then with the size of the alignment mask and operation.
For example:
A=0, Size=8, then Alignment_down (a,size) =0, alignment_up (a,size) =0.
A=6, Size=8, then Alignment_down (a,size) =0, alignment_up (a,size) =8.
A=8, Size=8, then Alignment_down (a,size) =8, alignment_up (a,size) =8.
A=14, Size=8, then Alignment_down (a,size) =8, alignment_up (a,size) =16.
Note: size should be 2 of the n-th square, i.e. 2, 4, 8, 16, 32, 64, 128, 256, 1024, 2048, 4096 ...

Implementation examples:
struct SYN_STR {int s_variable;};
void *p = malloc (sizeof (struct syn_str) + cache_line);
Syn_str *align_p= (syn_str*) ((((int) p) + (cache_line-1)) &~ (cache_line-1);
3) Branch Prediction (branch prediction)
The code is sequentially arranged in memory. For branch programs, if the code after the spoke statement has a greater chance of execution, then the jump is reduced, and the general CPU has the command prefetch function, which increases the chance of the instruction Prefetch hit. Branch prediction uses macros such as likely/unlikely, which generally require compiler support, which is a static branch prediction. There is also a lot of CPU support to save the result of executing branch instruction inside the CPU (the cache of branch instructions), so the static branch prediction doesn't have much meaning. If the branch is meaningful, then any branch will be executed, so the result of static branch prediction is not much better in a given situation, and likely/unlikely has a great impact on the code (which affects readability), so this method is generally not recommended.
if (likely (value)) is equivalent to if (value)
if (unlikely (value)) is also equivalent to if (value)
That means likely () and unlikely () are the same from the point of view of reading and Understanding Code!!!

The two macros are defined in the kernel as follows:
#define LIKELY (x) __builtin_expect ((x), 1)
#define UNLIKELY (x) __builtin_expect ((x), 0)
__builtin_expect () is provided to the programmer by GCC (version >= 2.96) to provide the "branch transfer" information to the compiler so that the compiler can optimize the code to reduce the performance degradation caused by instruction jumps.
__builtin_expect ((x), 1) indicates that the value of x is more likely to be true;
__builtin_expect ((x), 0) means that the value of x is more likely to be false.
That is, with likely (), the chance to execute a statement after an if is greater, with unlikely (), the chance to execute the statement following else is greater.
For example, the following code, the author thinks that Prev is not equal to next is more likely,
if (likely (prev! = next)) {
Next->timestamp = Now;
...
} else {
...;
}
In this way, the compiler will keep the more likely code up-to-date in the compilation process, thus reducing the performance degradation of the instruction jump.
Here are two examples to deepen this understanding:
A first example: example1.c
int testfun (int x)
{
if (__builtin_expect (x, 0)) {
^^ ^---We instruct the compiler, "else" block is more probable
x = 5;
x = x * x;
} else {
x = 6;
}
return x;
}
In this case, we think X is more likely to be 0.
After compiling, the objdump to observe the assembly instructions, on my 2.4 kernel machine, the results are as follows:
# gcc-o2-c EXAMPLE1.C
# objdump-d EXAMPLE1.O
Disassembly of section. Text:
00000000 <testfun>:
0:55 Push%EBP
1:89 e5 mov%esp,%ebp
3:8B mov 0x8 (%EBP),%eax
6:85 C0 Test%eax,%eax
8:75 jne <testfun+0x11>
A:b8, XX, mov $0x6,%eax
F:c9 leave
10:C3 ret
11:b8-XX mov $0x19,%eax
16:eb F7 jmp F <testfun+0xf>

As you can see, the compiler uses the jne (unequal jump) directive, and the code in the Else block immediately follows.
8:75 jne <testfun+0x11>
A:b8, XX, mov $0x6,%eax

A second example: example2.c
int testfun (int x)
{
if (__builtin_expect (x, 1)) {
^^ ^---We instruct the compiler, "if" block is more probable
x = 5;
x = x * x;
} else {
x = 6;
}
return x;
}
In this case, we don't think X is more likely to be 0.
After compiling, by Objdump to observe the assembly instructions, on my 2.4 kernel machine, the results are as follows:
# gcc-o2-c EXAMPLE2.C
# objdump-d EXAMPLE2.O

Disassembly of section. Text:
00000000 <testfun>:
0:55 Push%EBP
1:89 e5 mov%esp,%ebp
3:8B mov 0x8 (%EBP),%eax
6:85 C0 Test%eax,%eax
8:74 JE <testfun+0x11>
A:b8-XX mov $0x19,%eax
F:c9 leave
10:C3 ret
11:b8, XX, mov $0x6,%eax
16:eb F7 jmp F <testfun+0xf>

This time the compiler uses the JE (equal jump) instruction, and the code in the If block is immediately followed.
8:74 JE <testfun+0x11>
A:b8-XX mov $0x19,%eax

(RPM) CPU cache is aligned with memory

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More