Write Java also need to understand CPU cache

Source: Internet
Author: User

CPU, generally think that write-C + + needs to understand, write high-level language (Java/c#/pathon ...) There is no need to understand the underlying stuff. I thought so at first, but until I met Lmax's disruptor, as well as Martin's blog post, I found that Java was more than a CPU to ignore. After a period of reading, hope to summarize their reading after the sentiment. This paper mainly talks about the effect of CPU cache on Java programming, and does not involve the mechanism and implementation of CPU cache.

The cache structure of modern CPUs is generally divided into three layers, l1,l2 and L3. As shown in the following:

The smaller the cache, the closer it is to the CPU, which means that the faster and less capacity is possible.

L1 is the closest to the CPU, it has the smallest capacity, the fastest, each core has a L1 cache (accurately said each core has two L1 cache, a storage data l1d cache, a save instruction l1i cache);

L2 cache larger, such as 256K, slower, generally, each core has a separate L2 Cache;

The L3 cache is the largest level in the level three cache, such as 12MB, which is also the slowest level, with a L3 cache in the kernel shared between the same CPU slots.

When the CPU is working, it first goes to L1 to find the data it needs, then goes to L2, then goes to L3. If the level three cache does not find the data it needs, the data is fetched from memory. The longer the path you are looking for, the longer it takes. So if you want to get some data very frequently, make sure that the data is in the L1 cache. This speed will be very fast. The following table represents the approximate speed of the CPU to each cache and memory:

Approximate time required for CPU cycles from CPU to approximately required (NS per unit)
Register 1 cycle
L1 Cache ~3-4 Cycles ~0.5-1 NS
L2 Cache ~10-20 Cycles ~3-7 NS
L3 Cache ~40-45 Cycles ~15 NS
Cross-groove transmission to NS
Memory ~120-240 Cycles ~60-120ns

With Cpu-z, you can view information about the CPU cache:

Under Linux, you can use the following commands to view:

With an overview of the CPU above, let's look at the cache line. Cache, which is made up of cache rows. A typical row of cache rows has 64 bytes (known as "64-byte line Size"). So when using a cache, it is not a single byte to use, but a row of cache rows, a row of cache rows to use, in other words, the CPU access cache is a row, for the smallest unit operation.

This means that the program may experience performance problems if the cache line is not being used properly. The following procedure can be seen:

 1 public class L1cachemiss {2 private static final int RUNS = 3 private static final int dimension_1 = 1024 * 1024; 4 private static final int dimension_2 = 6; 5 6 private static long[][] longs; 7 8 public static void Main (string[] args) throws Exception {9 Thread.Sleep (10000); longs = new lo             ng[dimension_1][];11 for (int i = 0; i < dimension_1; i++) {longs[i] = new long[dimension_2];13         for (int j = 0; J < Dimension_2; J + +) {Longs[i][j] = 0l;15}16}17 System.out.println ("Starting ..."); Long sum = 0l;20 for (int r = 0; R < RUNS; r++) {final long start = System.nanotime ();//slow25//for (int j = 0; J &L T dimension_2;                J + +) {+//for (int i = 0; i < dimension_1; i++) {//sum + = longs[i][j];28//          }29//}30 31   Fast32 for (int i = 0; i < dimension_1; i++) {All for (int j = 0; J < Dimension_2; j + +) {sum + + longs[i][j];35}36}37 System.out.println (S Ystem.nanotime ()-start)); 39}40 41}42}

Take the Xeon E3 CPU I use and the 64-bit operating system and the 64-bit JVM as an example, as described here, assuming the compiler takes an array of row-main-order storage.

64-bit system, the Java array Object header is fixed at 16 bytes (unconfirmed), and the long type occupies 8 bytes. So the 16+8*6=64 byte is exactly equal to the length of a cache line:

As shown in the 32-36 line code, each time the loop is started, the chunk of data fetched from memory actually overwrites the entire data of longs[i][0] to Longs[i][5] (just 64 bytes). Therefore, the inner loop when all the data in the L1 cache can be hit, the traversal will be very fast.

If you replace 32-36 lines of code with 25-29 lines of code, there will be a lot of cache invalidation. Because every time from memory crawl is the same data block of different columns (such as longs[i][0] to longs[i][5] all the data, but the next goal of the loop, but the same column is not peers (such as Longs[0][0] Next is longs[1][0], resulting in longs[0] [1]-longs[0][5] cannot be reused.) The running time gap is, for example, the unit is microseconds (US):

Finally, we all want the data to be in the L1 cache, but in fact often backfired, so cache Miss is often the thing that we need to avoid.

In general, there are three scenarios for cache invalidation:

1. For the first time access to the data, there is no such data in the cache, so the cache miss can be resolved by prefetch.

2. Cache conflicts, which need to be addressed by completion (pseudo-share generation).

3. Cache full, in general we need to reduce the data size of the operation, as far as possible according to the physical order of data access data.

Reference:

Http://mechanitis.blogspot.hk/2011/07/dissecting-disruptor-why-its-so-fast_22.html

http://coderplay.iteye.com/blog/1485760

Http://en.wikipedia.org/wiki/CPU_cache

Transferred from: http://www.cnblogs.com/techyc/p/3607085.html

Write Java also need to understand CPU cache

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.