Understanding the system structure from the Java perspective serial, follow my Weibo (link) Learn about the latest developments it is well known that the CPU is the brain of the computer, which is responsible for executing the instructions of the program; Memory is responsible for saving data, including the program's own data. As we all know, memory is much slower than CPU. In fact, 30 years ago, the CPU frequency and memory bus frequency at the same level, access to memory is only a bit slower than accessing the CPU register. Since memory development is limited to technology and cost, it is now necessary to get a single piece of data in memory for more than 200 CPU cycles (CPU cycles), while CPU registers typically have 1 CPU cycles.
CPU Cache
Web browser in order to speed up, will be in the local storage cache previously browsed data; Traditional database or NoSQL database in order to speed up the query, often set a cache in memory, reduce the disk (slow) IO. The same memory is too far from the CPU, so the CPU designers add a cache (CPU cache) to the CPU. If you need to operate on the same batch of data many times, then putting the data into a cache that is closer to the CPU will give the program a great speed boost. For example, make a loop count, put the count variable in the cache, and don't have to go to the memory every time the loop accesses the data. Here is the simple CPU cache.
With the development of multicore, the CPU cache is divided into three levels: L1, L2, L3. The smaller the level, the closer to the CPU, the faster it is, and the smaller the capacity. L1 is the closest to the CPU, it has the smallest capacity, such as 32K, the fastest, each core has a L1 cache (exactly two L1 cache per core, a storage data l1d cache, a Save command l1i cache). L2 cache larger, such as 256K, slower, generally, each core has a separate L2 Cache; The L3 cache is the largest level in the level three cache, such as 12MB, which is also the slowest level, with a L3 cache in the kernel shared between the same CPU slots.
From CPU to |
Approximately the CPU cycles required |
Approximate time required (unit NS) |
Register |
1 cycle |
|
L1 Cache |
~3-4 Cycles |
~0.5-1 NS |
L2 Cache |
~10-20 Cycles |
~3-7 NS |
L3 Cache |
~40-45 Cycles |
~15 NS |
Cross-groove Transmission |
|
ns |
Memory |
~120-240 Cycles |
~60-120ns |
Interested students can use Cat/proc/cpuinfo under Linux, or Ubuntu under LSCPU to see the cache of their own machine, more fine can be seen through the following commands:
Shell Code
$ cat/sys/devices/system/cpu/cpu0/cache/index0/size
62.
$ cat/sys/devices/system/cpu/cpu0/cache/index0/type
Data
$ cat/sys/devices/system/cpu/cpu0/cache/index0/level
1
$ cat/sys/devices/system/cpu/cpu3/cache/index3/level
3
Just like the database Cache, the first thing to get data is to find the data in the fastest cache, and if there is no hit (cache miss) then look down, until the level three cache is not found, then just to the memory to the data. The longer you miss, the more time it takes to consume data.
cached lines (cache line)
In order to efficiently access the cache, it is not easy to write a single piece of data to the cache. The cache is made up of cache rows, and a typical row is 64 bytes. Readers can see cherency_line_size by the shell command below to know how big the machine's cache line is.
Shell Code
The CPU access cache is operated in the smallest unit of behavior. Here I will not mention the associativity problem of caching, simplifying the problem a bit. A Java Long is 8 bytes, so you can get 8 long variables from one cache line. So if you are accessing a long array, when a long is loaded into the cache, you will load the other 7 without consumption. So you can go through the group very quickly. Experiment and Analysis
In Java programming, if we do not pay attention to the CPU Cache, it will result in inefficient programs. For example, the following program has a two-dimensional long array that runs on my 32-bit notebook when the memory is distributed:
The 32-bit machine has a total of 16 bytes of array object headers in Java (see links for details), plus 62 long rows of Long data representing 512 bytes. So this two-dimensional data is arranged sequentially.
Java code
public class L1cachemiss {
private static final int RUNS = 10;
private static final int dimension_1 = 1024 * 1024;
private static final int dimension_2 = 62;
private static long[][] longs;
public static void Main (string[] args) throws Exception {
Thread.Sleep (10000);
longs = new long[dimension_1][];
for (int i = 0; i < dimension_1; i++) {
Longs[i] = new Long[dimension_2];
for (int j = 0; J < Dimension_2; J + +) {
LONGS[I][J] = 0L;
}
}
System.out.println ("Starting ...");
Final long start = System.nanotime ();
Long sum = 0L;
for (int r = 0; r < RUNS; r++) {
for (int j = 0; J < Dimension_2; J + +) {
for (int i = 0; i < dimension_1; i++) {
Sum + = Longs[i][j];
// }
// }
for (int i = 0; i < dimension_1; i++) {
for (int j = 0; J < Dimension_2; J + +) {
Sum + = Longs[i][j];
}
}
}
System.out.println ("duration =" + (System.nanotime ()-start));
}
}
Run after compiling, the result is as follows
Shell Code
$ Java L1cachemiss
Starting ....
Duration = 1460583903
Then we cancel the 22-26-line comment, put 28-32 lines of comments, compile and run again, and the result is worse than we expected?
Shell Code
$ Java L1cachemiss
Starting ....
Duration = 22332686898
The previous program only took 1.4 seconds, only one line of the swap to run for 22 seconds. From the previous section we know that when loading longs[i][j], longs[i][j+1] is likely to be loaded into the cache too, so immediate access to longs[i][j+1] will hit L1 Cache, and if you visit longs[i+1][j] The situation is different, and it is likely that the cache miss will result in inefficiency. Below we use Perf to verify, first run the fast program.
Shell Code
$ perf Stat-e l1-dcache-load-misses java L1cachemiss
Starting ....
Duration = 1463011588
Performance counter stats for ' Java L1cachemiss ':
164,625,965 l1-dcache-load-misses
13.273572184 seconds Time Elapsed
Altogether 164,625,965 times L1 cache miss, and then look at the slow program
Shell Code
$ perf Stat-e l1-dcache-load-misses java L1cachemiss
Starting ....
Duration = 21095062165
Performance counter stats for ' Java L1cachemiss ':
1,421,402,322 l1-dcache-load-misses
32.894789436 seconds Time Elapsed
This time produced 1,421,402,322 l1-dcache-load-misses, so much slower.
I'm just an example of a cache miss that happens after the L1 cache is full. In fact, the cache miss the following three kinds of reasons:
1. For the first time access to the data, there is no such data in the cache, so the cache miss can be resolved by prefetch.
2. The cache conflict needs to be resolved by completing.
3. This is my example, cache full, generally we need to reduce the size of the operation of the data, as far as possible according to the physical order of data access data.
The specific information can be consulted in this article.
The next article introduces another misconception about CPU cache: pseudo-sharing (false sharing).
Understanding CPU Cache from Java Perspective (CPU caches)