Some things about Java in CPU

Source: Internet
Author: User

In fact, the person who writes Java seems to have nothing to do with the CPU. At most, it has something to do with how we can fully run the CPU and set the number of threads as mentioned above, but that algorithm is just a reference, in many different scenarios, we need to take practical measures to solve the problem. When the CPU is fully occupied, we will also consider how to make the CPU not so full, well, this article will talk about other things. Maybe you don't need to pay attention to the CPU when writing code in Java, because satisfying the business is the first important thing. If you want to achieve the framework level, providing a lot of shared data cache and other things for the Framework, there must be a lot of data requisition problems in the middle, of course, Java provides a lot of concurrent package classes, you can use it, however, you must understand the details before using it. Otherwise, you may not need to use it. This article may not focus on the details, because for example, the title party: we want to talk about CPU, haha.


It seems that Java has little to do with the CPU. Let's talk about the relationship;

1. When we encounter a shared element, we usually use volatile to ensure consistent read operations, that is, absolute visibility. The so-called visibility means that each time we use the data, when the CPU does not use any cache content, it will capture data from the memory once, and this process is still valid for multiple CPUs, that is, it is synchronized between the CPU and the memory at this time, the CPU sends an assembly Command similar to the lock addl 0 command like the bus, and the Assembly command is + 0, but it will not do anything. However, once the command is completed, subsequent operations will no longer affect access to other threads of this element, that is, the absolute visibility that it can achieve, but it cannot implement consistent operations, that is, volatile cannot implement consistency of I ++ operations (concurrent under multiple threads), Because I ++ operations are decomposed:

int tmp = i;tmp = tmp + 1;i = tmp;

The three steps are completed. From this point, you can also see why I ++ can do other things first and then add 1 to itself, because it gives the value to another variable.


2. When we need multi-thread concurrency consistency, we need to use the lock mechanism. Currently, something similar to atomic * can basically meet these requirements, and many unsafe class methods are provided internally, by constantly comparing the absolute visibility data, we can ensure that the obtained data is up-to-date. Next we will continue to talk about other CPU tasks.


3. In the past, we started to ignore the memory and CPU latency in order to fully run the CPU, but we were not satisfied in any case, let's simply talk about the latency. Generally, the current CPU has three-level cache, which has different latencies in the age, so the specific number can only be roughly described. The current CPU generally has a latency of 1-2 ns, level-2 cache is generally about a few ns to 10 ns, level-3 cache is generally about 30ns to 50ns, and memory access is generally about 70ns or even more (computer development speed is very fast, this value is only used for data on some CPUs for a range reference.) although the latency is very small, it is in the nanoseconds, you will find that when your program is split into command operations, there will be a lot of CPU interaction, and the latency of each interaction will change if there is such a large deviation;


4. Return to the volatile mentioned earlier. Every time it gets data from the memory, it just gives up the cache. Naturally, if it is in some single-thread operations, it will become slower, sometimes we have to do this. Even read/write operations require consistency, and even the entire data block is synchronized. We can only reduce the granularity of the lock to a certain extent, but cannot completely eliminate the lock, even at the CPU level, there are limits on the command level, as shown below:


5. Atomic operations at the CPU level are generally called barrier, with read barrier and write barrier. It is generally triggered based on a single point. When the program sends multiple commands to the CPU, some commands may not be executed in the program order. Some commands must be executed in the program order, as long as they can be eventually consistent. In the sorting, JIT changes during running, the CPU command level also changes, mainly because the command is optimized to make the program run faster.


6. The CPU level performs cache line operations on the memory. The so-called cache line reads a piece of memory continuously, which is generally related to the CPU model and architecture, currently, many CPUs generally read 64 bytes of continuous memory each time, and 32 bytes of memory are available in the early days. Therefore, some arrays are faster to traverse (column-based traversal is slow ), however, this is not entirely true. The following describes the opposite situation.


7. If the CPU modifies the data, you have to say that the CPU modifies the data. If the data is read, the data can be read concurrently by multiple threads under multiple CPUs, when a write operation is performed on a data block, it is different. The data block will have exclusive, modified, and invalid statuses. After the data is modified, it will naturally become invalid. When it is under multiple CPUs, when multiple threads modify the same data block, bus data copying (QPI) between CPUs occurs ), of course, there is no way to modify the data to the same data, but returning to the cache line at makes the problem quite troublesome. If the data is on the same array, the elements in the array are cached at the same time.
Line to a CPU, the multi-thread QPI will be very frequent, sometimes this problem occurs even if the array is assembled on the object, such:

class InputInteger {   private int value;   public InputInteger(int i) {      this.value = i;   }}InputInteger[] integers = new InputInteger[SIZE];for(int i=0 ; i < SIZE ; i++) {   integers[i] = new InputInteger(i);}

At this time, you can see that all the objects in integers are objects, and only objects are referenced in the array. However, in theory, the objects are independent and will not be stored continuously, however, when Java allocates object memory, it is often allocated consecutively in the Eden region. If there is no access from other threads in the for loop, these objects will be stored together, even if they are GC to the old area, therefore, it seems unreliable to modify the entire array after a simple object is used to solve the cache line, because int
It is 4 bytes. In 64 mode, if the size is 24 bytes (with a 4-byte complement), the pointer compression is enabled in 16 bytes. That is, each time the CPU can have 3-4 objects, how to Make cpucache, but does not affect the system's QPI, do not want to complete by separating objects, because the memory copy process in the GC process is likely to be copied together, the best way is to complete, although it is a waste of memory, this is the most reliable method, that is, to add the object to 64 bytes. If the pointer compression is not enabled, there are 24 bytes, and there are 40 bytes, you only need to add five long strings to the object.

class InputInteger {   public int value;   private long a1,a2,a3,a4,a5;}

Haha, this method is very good, but it works very well. Sometimes, when JVM compilation finds that these parameters have not been done, it will be killed directly for you, and the optimization is invalid, the soil adding method simply performs an operation on these five parameters in a method body, but this method never calls it.


8. At the CPU level, it may not be possible to do the first thing first. For example, to obtain a lock, perform the atomicintegerfieldupdater operation. If getandset (true) is called) in a single thread, you will find that it runs quite fast and starts to slow down under multi-core CPUs. Why is it clear above, because the getandset contains the modified and compared data, first, the QPI will be very high. Therefore, it is better to first perform get operations and then modify the QPI. Also, you can obtain the QPI once. If you cannot obtain the QPI, just give in, let other threads do other things;


9. In some cases, there are many algorithms to solve some CPU busy and not busy problems. For example, numa is one of the solutions, however, no matter which architecture is useful in certain scenarios, it may not be effective for all scenarios; there is a queue lock mechanism to manage the CPU status, but this has the cache line problem, because the status changes frequently, the kernel of various applications also generates some algorithms to work with the CPU, so that the CPU can be used more effectively, such as the clh queue.


There will be a lot of details about this, such as the general variable loop superposition, the volatile type and the atomic * series, which are completely different. Multi-dimensional array loop, the cycle by different latitudes and backward order is also different. There are a lot of details and you can see why there is inspiration in the actual optimization process. The details of the lock are too detailed and dizzy, at the underlying level of the system, there are always lightweight atomic operations. No matter who says his code does not need to be locked, the shortest can be fine-tuned to the fact that the CPU can only execute one command at a time. multiple CPU cores also have a shared zone to control some content at the bus level, there are read, write, and memory levels. In different scenarios, the lock granularity is minimized, so the system performance is self-evident and the result is normal.


This article is just for reference!


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.