Java Optimization on multi-core platforms

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

At present, multi-core CPU is the mainstream. The multi-core technology can effectively leverage hardware capabilities and improve throughput. For Java programs, concurrent garbage collection can be implemented. However, Java uses multi-core technology to bring about some problems, mainly because of multi-thread shared memory. At present, the bandwidth between the memory and the CPU is a major bottleneck. Each core can enjoy a portion of the cache, which can improve performance. JVM uses the "Lightweight Process" of the operating system to implement threads. Therefore, every time a thread operates on shared memory, it cannot hit the cache. This is a system call with a high overhead. Therefore, unlike general optimization, special optimization is required for multi-core platforms.

Code optimization

The number of threads must be greater than or equal to the number of cores.

If multithreading is used, only when the number of running threads is greater than the number of cores can the CPU resources be drained. Otherwise, several cores are idle. It should be noted that if the number of threads is too large, it will occupy too much memory, resulting in performance not increasing and downgrading. JVM garbage collection also requires threads. Therefore, the number of threads here includes the JVM's own threads.

Minimize shared data write operations

Each thread has its own working memory. In this region, the system can optimize it without any scruples. If you read the shared memory area, the performance will not decrease. However, once a thread wants to write shared memory (using the volatile keyword), many memory barrier (memory barrier or memory fence) commands will be inserted to ensure that the processor does not execute in disorder. Compared with the variables in the local thread, the performance is much lower. The solution is to minimize the amount of shared data, which also conforms to the design principle of "data coupling.

Use the synchronize keyword

In java1.5, synchronize is inefficient in performance. Because this is a heavyweight operation and the operation interface needs to be called, it may take more time to lock the system than to perform other operations. Compared with the lock object provided by Java, the performance is higher. However, Java 1.6 has changed. Synchronize has a clear semantics and can be optimized to adapt to spin, lock elimination, lock roughening, lightweight locks, and biased locks. The synchronize performance on java1.6 is no worse than lock. Officials also said they also supported synchronize and there is room for optimization in future versions.

Use optimistic strategies

Traditional synchronous concurrency policies are pessimistic. The syntax is as follows: when multiple threads operate on an object, two threads always operate at the same time, so they need to be locked. The optimistic strategy is that, if there is a normal thread access, retry when there is a conflict. This is more efficient. Java's atomicinteger uses this policy.

Use threadlocal)

Threadlocal can be used to generate a copy of the local object of the thread and will not be shared with other threads. When the thread is terminated, all local variables can be recycled.

Sort fields in the class

You can put several fields frequently accessed by a class together, so that they are more likely to be added to the cache together. At the same time, it is best to put them in the head. Do not scatter between basic variables and reference variables.

Batch Processing Array

Now the processor can use a command to process multiple records in an array. For example, it can read or write store records to a byte array at the same time. Therefore, try to use the batch interface such as system. arraycopy () instead of operating the array by yourself.

JVM Optimization

Enable Large Memory Page

The current operating system quota page is 4 K. If your heap is 4 GB, it means you need to perform 1024*1024 Allocation Operations. So it is best to increase the page size. This quota is designed for the operating system and cannot be changed to JVM. The configuration on Linux is a little complicated and not detailed in detail.

Uselargepages is enabled by default in java1.6, and lasrgepageszieinbytes is set to 4 MB. In some cases, the configuration is 128 MB. In the official performance test, the configuration is 256 MB.

Enable compression pointer

Java's 64-Bit performance is slower than 32 because its pointer is extended from 32-bit to 64-bit. Although the addressing space is extended from 4 GB to 256 TB, it causes performance degradation, and occupied more memory. Therefore, the pointer is compressed. The compressed pointer supports a maximum of 32 GB memory and can achieve 32-bit JVM performance.

Jdk6 update 23 is enabled by default. in earlier versions, you can use-XX: + usecompressedoops to start the configuration.

Performance can be viewed in this evaluation. The performance improvement is considerable.

Enable NUMA

NUMA is a CPU feature. In SMP architecture, CPU cores are symmetric, but they share a system bus. Therefore, if you have more CPUs, the bus will become a bottleneck. In the NUMA architecture, several CPUs constitute a group, with point-to-point communication between groups, independent of each other. Enabling it can improve performance.

NUMA can be enabled only when both hardware, operating system, and JVM are enabled. In Linux, numactl can be used to configure NUMA, and JVM can be enabled through-XX: + usenuma.

Radical optimization features

In java1.6, aggressive optimization (aggressiveopts) is enabled by default. Radical optimization is an optimization option that is usually released in the next version. But it may cause instability. Some time ago, the JDK 7 bug was detected after this option was enabled.

Escape Analysis

After an object is created in a method, if it is passed out, it can be called method escape. If it is passed to another thread, it becomes a thread escape. If you know that an object does not escape, you can allocate it to the stack instead of the stack to save GC time. At the same time, this object can be split up and its member variables can be directly used to facilitate the use of high-speed cache. If an object does not have a thread to escape, you can cancel all the synchronization operations, greatly improving the performance.

However, escape analysis is very difficult, because it takes the CPU to analyze an object. If it does not escape, it cannot be optimized, and the previous analysis is completely lost. Therefore, you cannot use complex algorithms, and the current JVM does not implement stack allocation. Therefore, the performance may also decrease after enabling.

You can use-XX: + doescapeanalysis to enable escape analysis.

High-throughput GC Configuration

For high throughput, parallel scavenge can be used in the upper state, and parallel old garbage collector can be used in the old state.
Enable-XX: + useparalleloldgc

You can adjust-XX: parallelgcthreads based on the number of CPUs. It can be 1/2 of the number of CPUs or 5/8

Low-latency GC Configuration

For low-latency applications, parnew can be used in the STANDBY state, and CMS garbage collector can be used in the old state.

You can enable-XX: + useconcmarksweepgc and-XX: + useparnewgc.

You can adjust-XX: parallelgcthreads based on the number of CPUs. It can be 1/2 of the number of CPUs or 5/8

You can adjust-XX: maxtenuringthreshold (promoted to the age of the older generation). The default value is 15. This reduces the GC pressure of the older generation.

You can use-XX: target1_vorratio to adjust the occupation rate of the specified vor. The default value is 50%. You can increase the utilization of VOR in the region.

You can adjust-XX: Adjust vorratio to adjust the proportion of Eden and vor. The default value is 8. The smaller the proportion, the larger the volume vor, the more time the object can stay in the volume state.

And so on.

See: Java optimization White Paper and Java Virtual Machine

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Java Optimization on multi-core platforms

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Java Optimization on multi-core platforms

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support