Talk about Java and CPU relationships

Talk about Java and CPU relationships _java

Last Update:2017-01-19 Source: Internet

Author: User

Tags arrays visibility volatile

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In fact, the people who write Java seems to have nothing to do with the CPU, at most and we mentioned in the previous how to run the CPU full, how to set the number of threads is a bit of a relationship, but the algorithm is only a reference, many scenarios need to take practical means to solve can And we'll consider how to make the CPU not so full after the CPU is full, oh, human, is so xx, oh, well, this article to say is some other things, perhaps you write code in Java almost do not pay attention to the CPU, because to meet the business is the first important thing, if you want to achieve the framework level , for the framework to provide a lot of shared data caching, and so on, there must be a lot of data acquisition problem, of course, Java provides a lot of concurrent package of classes, you can use it, but it inside how to do, you need to understand the details can be used better, otherwise not as well, This article may not be an elaboration of these content as an emphasis, as if the title party: We want to say CPU, hehe.

Or that sentence, seemingly Java and CPU does not have much relationship, we now come to talk about what relationship;

1, when the sharing element, we usually first idea is through volatile to ensure consistent read operation, that is, absolute visibility, the so-called visibility, is every time to use the data, the CPU will not use any cache content will be from memory to crawl data, And this process is still valid for multiple CPUs, that is, the CPU and memory are synchronized between the time, the CPU will be like the bus issued a lock ADDL 0-class assembly instructions, +0 but not to do anything, but once the instruction completes, the subsequent operation will no longer affect the access of other threads of this element, That is, the absolute visibility he can achieve, but not the consistency of the operation, that is to say, volatile can not implement the i++ of such operations (in multithreading concurrency), because the i++ operation is decomposed into:

int tmp = i;
TMP = tmp + 1;
i = tmp;

These three steps to complete, from this point you can also see why i++ can do other things first and then add 1, because it tells the value given to another variable.

2, we want to use multithreading concurrency consistency, you need to use the mechanism of the lock, the current similar atomic* things can meet these requirements, the interior provides a lot of unsafe class methods, through the continuous contrast of absolute visibility data to ensure that the data is the latest And then we go on to say something else about the CPU.

3, before we in order to run the CPU full, but anyway run dissatisfied, because we began to say ignore the memory and CPU delay, now that we mention here, we simply said the delay, generally speaking now CPU has three cache, the age of different delays, so the specific numbers can only say a ballpark, Now the CPU of the general level of cache delay in 1-2ns, two cache is usually a few ns to about a dozen NS, three cache is generally 30ns to 50ns, memory access is generally up to 70ns or more (computer development fast, This value is also only on some CPU data, do a scope reference only); Although this delay is very small, are nanosecond levels, you will find that your program is divided into instruction operations, there will be many CPU interaction, each interaction delay if there is such a large deviation, at this time the system performance will be changed;

4, back to just said volatile, it every time from the memory to get data, is to give up the cache, naturally, if in some single-threaded operation, will become more slow, sometimes we have to do so, and even read and write operations require consistency, and even the entire block of data are synchronized, We can only reduce the granularity of the lock to a certain extent, but it cannot be completely unlocked, even the level of the CPU itself is limited by the instruction level.

5, at the CPU level of the atomic operation is generally called a barrier, read barriers, write barriers, etc., is generally based on a point of the trigger, when the program to send more than one instruction to the CPU, some instructions may not be in accordance with the order of the program to execute, some must follow the order of the program to execute, as long as the final guarantee of consistency; The JIT changes at run time, and the CPU instruction level changes, mainly to optimize the run-time instructions to make the program run faster.

6, the CPU level of memory to do the cache line operation, the so-called cache line will read a piece of memory, general and the CPU model and the structure of the relationship, now a lot of CPU each read continuous memory is generally 64byte, the early 32byte, So it's faster when some arrays are traversed (based on the slow traversal of columns), but this is not entirely right, and the following will be compared to the opposite.

7, the CPU on the data if the change, at this point have to say the CPU on the data modification state, if the data is read, in the multi-CPU can be multithreaded parallel read and, when the data block occurs when the write operation, it is not the same, the data block will have exclusive, modify, failure and other states, the data will be modified naturally after the failure , when under multiple CPUs, when multiple threads are modifying the same block of data, there will be a copy of the bus data between the CPUs (QPI), of course, if we are not able to modify the same data, but back to the 6th cache line, the problem is more troublesome, If the data is on the same array, and the elements in the arrays are simultaneously cache line to a CPU, multi-threaded QPi will be very frequent, sometimes even if the array is assembled on the object can also appear this problem, such as:

Class Inputinteger {
private int value;
Public Inputinteger (int i) {
this.value = i
}}
inputinteger[] integers = new Inputinteger[size];
for (int i=0 i < SIZE; i++) {
integers[i] = new Inputinteger (i);
}

At this point you can see that all the objects in the integers, there is only a reference to an object on the array, but the arrangement of the objects is theoretically independent and not contiguous, but Java, when allocating object memory, is often distributed continuously in the Eden region, when in the For Loop, If no other threads are connected, these objects will be stored together, even if the GC to the old area is very likely to put together, so rely on simple objects to solve the entire array after the cache is not a good way to modify, because int is 4 bytes, if in 64 mode, This size is 24 bytes (4byte complete), pointer compression is 16byte, that is, each CPU can emulate 3-4 objects, how to let Cpucache, but does not affect the system's QPI, do not want to separate objects to complete, Because the GC process memory copy process is likely to be copied together, the best way is to make up, although a bit of waste of memory, but this is the most reliable way, is to complete the object to 64 bytes, the above if the pointer compression is not open 24byte, at this time there are 40 bytes, just to add 5 long within the object.

Class Inputinteger {public
int value;
Private long a1,a2,a3,a4,a5;
}

Oh, this method is very soil, but very useful, sometimes, JVM compile time found that these several parameters did not do anything, directly to you to kill, the optimization is ineffective, the soil method is in the body of a simple to these 5 parameters to do an operation (all used), but this method will never call it.

8, at the level of the CPU sometimes may not be able to do first as far as possible to do the truth for the king, similar to get lock this operation, in the Atomicintegerfieldupdater operation, if the call Getandset (true) in a single path you will find that the run is quite fast, Starts to slow down on multi-core CPUs, why the above said very clearly, because the getandset inside is the change after contrast, first change to say, QPi will be very high, so this time, first do get operation, and then modify the better practice; there is to obtain once, if not get, on the concession, Let the other threads do the other thing;

9, CPU sometimes in order to solve some CPU busy and not busy problems, there will be many algorithms to solve, such as Numa is one of the scenarios, but no matter what kind of architecture in a certain scenario is more useful, for all scenarios may not be effective; there is a queue lock mechanism to complete CPU state management, However, there is the problem of cache line, because the state is often changed, the kernel of various types of applications to match the CPU will also have some algorithms to do, so that the CPU can be more effective use, such as CLH queue.

The details of this will be many, such as the use of ordinary variable cyclic superposition and the use of volatile type and the atomic* series to do, is completely different; multi-dimensional degree group cycle, according to the back order of different latitude to cycle is not the same, the details of a lot of points, Understand why there is inspiration in the actual optimization process; the details of the lock are too thin and faint, at the bottom level of the system, there are always some lightweight atomic operations, no matter who says that his code does not need to be locked, the finest can be thin to the CPU in each moment can only execute one instruction so simple, Multi-core CPU in the bus level will also have a shared area to control some content, read level, write level, memory level, in different scenarios to make the lock granularity as far as possible, then the performance of the system is self-evident, very normal results.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More