How the Java GC works, how to optimize GC performance, and how to interact effectively with the GC
A goodJavaProgrammers must understandGC, how to optimize GC performance, how to interact effectively with GC, because some applications have higher performance requirements, such as embedded systems, real-time systems, etc. Only the overall increase in memory management efficiency can improve the performance of the entire application. This article first briefly introduces the working principle of GC, then discusses several key issues of GC, and finally puts forward some Java programming suggestions to improve the performance of Java program from the point of view of GC.
Fundamentals of GC
JavaMemory management is actually the management of objects, including the allocation and release of objects, for programmers, the allocation of objects using the New keyword, when releasing an object, as long as the object all references are assigned to NULL, so that the program can not access to the object, we call the object "unreachable." The GC will be responsible for reclaiming the memory space of all "unreachable" objects.
For GC, when a programmer creates an object, the GC starts to monitor the object's address, size, and usage. Typically, a GC uses a graph to record and manage all objects in the heap (heap) in such a way as to determine which objects are "accessible" and which are "unreachable". When the GC determines that some objects are unreachable, it is the responsibility of the GC to reclaim those memory spaces. However, in order to ensure that the GC can differentiate between platform implementations, the Java specification Standard does not strictly govern many of the GC's behavior. For example, there are no clear rules as to what type of recovery algorithm to use, when to recycle, and so on. Therefore, different JVM implementations often have different implementation algorithms. This also brings a lot of uncertainty to the development of Java programmers. This paper studies several issues related to GC work and tries to reduce the negative impact of this uncertainty on Java programs.
@@ 增量 GC (Incremental GC)
A GC is typically implemented by a process or group of processes in the JVM, which itself consumes heap space as a user program, consumes CPU at runtime, and stops running when the GC process is running. Therefore, when the GC runs longer, the user can feel the Java program's pause, on the other hand, if the GC runs too short, the object recovery may be too low, which means that there are still many objects that should be reclaimed are not recycled, still occupy a lot of memory. Therefore, when designing a GC, a tradeoff must be made between the pause time and the recovery rate.
A good GC implementation allows users to define their own required settings, such as memory-constrained devices, very sensitive to memory usage, hoping that the GC can accurately reclaim memory, it does not care about the slow program speed, and some real-time network games, can not allow the program to have a long interruption. Incremental GC is the use of a certain recovery algorithm, a long interruption, divided into a number of small interruptions, in this way to reduce the impact of GC on the user program. Although an incremental GC may not be as efficient as a normal GC in overall performance, it can reduce the maximum downtime of a program.
The hotspot JVM provided by the Sun JDK can support incremental GC. The HotSpot JVM default GC mode is to not use Delta GC, in order to start the Delta GC, we must increase the-XINCGC parameter when running the Java program. The HotSpot JVM incremental GC, implemented using the train GC algorithm, has the basic idea of grouping all objects in the heap by creation and usage (layering), using frequent and relevant objects in a team, and continuously adjusting the group as the program runs, when the GC is running, It always recycles the oldest (recently infrequently accessed) objects, and if the whole group is recyclable, the GC will recycle the whole set, so that each GC runs only a certain percentage of unreachable objects, guaranteeing smooth running of the program.
Finalize () function
Finalize is a way of thinking that is located in the object class, and the access modifier for this approach is protected, because all classes are subclasses of object, so the user class is easily accessible to this method of thinking. Because the Finalize function does not automatically implement chained calls, we must implement them manually, so the last statement of the Finalize function is usually super.finalize (). In this way, we can implement finalize calls from bottom to top, freeing our own resources before releasing the resources of the parent class.
According to the Java language Specification standard, the JVM guarantees that the object is unreachable until the Finalize function is called, but the JVM does not guarantee that the function will be called. In addition, the specification also guarantees that the Finalize function runs at most once.
Many Java beginners will consider this method of thinking similar to the destructor in C + +, the release of many objects, resources are placed in this function. In fact, this is not a very good way, for the following reasons: First, the GC in order to support the Finalize function, to cover the function of the object to do a lot of additional work, and second, after the finalize run is completed, the object may become reachable, the GC will also check whether the object is reachable, Therefore, using finalize reduces the performance of the GC, and third, because the time that the GC calls Finalize is indeterminate, freeing the resource in this way is also indeterminate.
In general, finalize is used for the release of some very important resources that are not easy to control, such as some I/O operations, data connections, and so on, and the release of these resources is critical for the application as a whole. In this case, the programmer should be managed by the program itself, including the release of these resources, with the Finalize function to release resources as a supplement to form a double-insurance management mechanism, and should not rely solely on finalize to release resources.
How the program interacts with the GC (does not understand ...) )
JAVA2 enhances the memory management feature by adding a Java.lang.ref package that defines 3 reference classes. These 3 reference classes are SoftReference, WeakReference, and Phantomreference, respectively. By using these reference classes, programmers can interact with the GC to some extent to improve GC productivity. These reference classes have a reference strength between the unreachable object and the unreachable object.
Some suggestions for Java coding
Based on how the GC works, we can make the GC run more efficiently and more in line with the requirements of the application with some tips and tricks. Here are a few suggestions for programming:
1, the most basic suggestion is to release the reference of the useless object as soon as possible. Most programmers use temporary variables so that reference variables are automatically set to null after exiting the active domain (scope). When we use this method, we must pay special attention to some complex object graphs, such as arrays, queues, trees, graphs, etc., which have mutual references and complex relationships. For such objects, GC recycling is generally less efficient. If the program allows, the unused reference object will be assigned null as soon as possible. This can speed up the work of the GC.
2. Use the Finalize function as little as possible. The Finalize function is a chance that Java provides the programmer with an opportunity to release objects or resources, but it increases the amount of GC work and therefore minimizes the use of a Finalize method to reclaim resources.
3, note the collection data types, including arrays, trees, graphs, linked lists and other data structures, these data structures for GC recycling is more complex. In addition, note the global variables, as well as static variables, which tend to cause hanging objects (dangling reference), resulting in memory wastage.
4, when the program has a certain waiting time, the programmer can manually execute System.GC (), notify the GC to run, but the Java language specification does not guarantee that the GC will be executed, the use of incremental GC can shorten the Java program pause time.
Java when the GC process runs, does the application stop running??
For large applications that create large numbers of objects, the JVM spends a lot of time on garbage collection (GC). By default, when a GC is in progress, the entire application must wait for it to complete, which may take a few seconds or more (the command-line option of the Java Application Launcher -VERBOSE:GC will cause each GC event to be reported to the console). To minimize these GC-induced pauses, which may affect the execution of fast tasks, you should minimize the number of objects created by your application. Similarly, it is helpful to run the plan code in a separate JVM. At the same time, several fine-tuning options can be tried to minimize GC pauses. For example, an incremental GC would try to spread the cost of the primary collection to a few small collections. This, of course, will reduce the efficiency of the GC, but this may be an acceptable cost to the time plan
Data reference: http://www.knowsky.com/362375.html
Java Virtual machine optimization option, GC description
Reference http://blog.sina.com.cn/s/blog_6d003f3f0100lmkn.html
There are many JVM options that affect benchmark testing. The more important options are:
* JVM Type: Server (-server) and Client (-client).
* Make sure there is enough memory available (-XMX).
* The type of garbage collector used (Advanced JVM provides many tuning options, but be careful to use).
* Whether to allow class garbage collection (-XNOCLASSGC). The default setting is to allow class GC; Using-XNOCLASSGC may compromise performance.
* Whether to perform escape analysis (-xx:+doescapeanalysis).
* Whether to support large page heap (-xx:+uselargepages).
* Whether to change the thread stack size (for example,-xss128k).
* Use JIT compilation: Always use (-xcomp), never use (-xint), or use only hotspots (-xmixed; This is the default option, which produces the best performance).
* The amount of profiling data collected during JIT compilation (-xx:compilethreshold), during background JIT compilation (-xbatch), or during a staged JIT compilation (-xx:+tieredcompilation).
* Whether to perform a biased lock (biasedlocking,-xx:+usebiasedlocking); note that JDK 1.6 and later will automatically perform this feature.
* Whether to activate the most recent experimental performance tuning (-xx:+aggressiveopts).
* Enable or disable assertions (-enableassertions and-enablesystemassertions).
* Enable or disable strict native call checking (-XCHECK:JNI).
* Enable memory location optimization (-xx:+usenuma) for NUMA multi-CPU systems.
Class Data sharing classes are shared.
Java5 introduces a class sharing mechanism that optimizes some of the most common base classes to a shared file when the Java program is first started, and temporarily supports only CLIENTVM and SERIALGC. stored in Client/classes.jsa, This is why the program is performing slower for the first time. Turn on the parameter-xshare.
J2se6 (code: Mustang Mustang) One of the main design principles is to improve the performance and scalability of J2SE, mainly by maximizing operational efficiency, better garbage collection and some client performance.
1, biased lock (biased locking)
Java6 previous lock operations result in an atomic CAs (compare-and-set) operation, and CAS operations are time-consuming, even if the lock does not actually conflict and is owned by only one thread, which can incur significant overhead. In order to solve this problem, the biased locking technique is introduced in Java6, that is, a lock is biased to the first locked thread, and the subsequent lock operation of the thread does not need to be synchronized. The approximate implementation is as follows: A lock is initially neutral state, when the first line is Cheng, the state of the lock is modified to biased, and the thread ID is recorded, and when a subsequent lock operation is made, if the discovery status is biased and the thread ID is the current thread ID, then only the lock flag is set. No CAS operations are required. Other threads to add this lock, you need to use a CAS operation to replace the state with revoke, and wait for the lock flag to clear, the state of the lock will become the default, often old algorithm processing. This feature can be disabled by the-xx:-usebiasedlocking command.
2. Lock coarsening (lock coarsening)
If a piece of code is frequently locked and unlocked, and there is nothing to do between unlocking and the next lock, you can combine multiple Gaga unlock operations into a pair. This feature can be-xx:-eliminatelocks forbidden.
3, adaptive Spin (Adaptive spinning)
Generally on multi-CPU machine lock implementation will contain a short-term spin process. The number of spins is not very good decision, spin less will cause the thread is suspended and the context switch increases, spin more CPU consumption. An adaptive spin technique is introduced for this Java6, which dynamically adjusts the spin times based on the probability of a lock's recent spin-lock success.
4. Heap of common large memory distributions (large page heap)
The large paging is the TLB mismatch rate on the X86/AMD64 architecture used to reduce the size of the TLB (virtual address to Physical address translation cache). This technique can be used by the memory heap in the JAVA6.
5. Improve array Copy Performance
Write a custom assembly array copy program for each type size.
6. Code optimization in the background
Background compilation in Hotspot™client Compiler: code optimization in the background
7. Linear Scan Register allocation algorithm (Linear scan registerallocation):
A new register allocation strategy, based on the SSA (static singleassignment), improves performance by about 10%. The common register allocation algorithm regards register allocation as a graph coloring problem, the time complexity is O (n^4), and does not apply to the JIT compilation of Java. The original JVM is based on some local heuristic rules to allocate registers, the effect is not very good, the linear Scan register algorithm used in JAVA6 can achieve the same effect as the graph color algorithm, and the time complexity is linear.
8. Parallel scaling and garbage collector (Parallel compaction Collector)
Use parallel garbage collection for full GC (originally non-full GC in JDK 5 is parallel but FULLGC is serial), use-XX:+USEPARALLELOLDGC to turn this feature on
9. Parallel low-pause garbage collector (Concurrent-Collector)
Explicit invocation of GC (such as SYSTEM.GC) can also be done in parallel to mark-sweep garbage collection, using-xx:+explicitgcinvokesconcurrent to open.
10. Ergonomics in the 6.0 Java Virtual machine
Automatically adjusts the garbage collection strategy, heap size, and other configurations, which are added in JDK 5, significantly enhanced in JDK6, and SPECjbb2005 performance increased by 70%.
11, the Boot class loader optimization
A meta-index file describing the package's jar file is added to the JRE to speed up the ClassLoader load class performance and improve desktop Java application startup speed (+15%). Memory footprint also reduced by 10%
12, graphics program optimization
Displays splash before the JVM starts.
OutOfMemoryError is a memory overflow, and there are a number of situations where memory overflow occurs.
1.java Heap Overflow Java.lang.OutOfMemoryError:Java heapspace.
2.java permanent heap overflow, usually reflection, proxy used more cause the class to generate too much, java.lang.OutOfMemoryError:PermGen space.
3. Local heap overflow, which may be due to the operating system can not allocate enough memory, may be the system is no memory, or the Java process memory space exhaustion, here is a bit of meaning, generally 32-bit system process only 4 G address space, And because the Java implementation uses the local heap or memory map area as the Java heap storage space, and then removes the kernel mapping area, Java uses the heap generally only 2G, and if the Java heap xmx is too large, causing the JNI local heap too small, also generates memory overflow. The local heap can be jni with new , malloc, or it could be an instance of Directbuffer.
java.lang.outofmemoryerror:request<size> bytes For<reason>. Out of swap space?
In this case, if the Java heap is sufficient, reducing the value of the XMX will solve the problem instead.
Overflow of the 4.jni method. The former is a local overflow detected by the JVM, and this is not possible to allocate memory when the JNI method is called.
Java.lang.outofmemoryerror:<reason> <stacktrace> (Native method)
JDK7 performance Optimization.
1. (Zero Based) Compressed OOPS
In a 64-bit CPU, the JVM's OOP (ordinary object pointer) is 64-bit, simply speaking, OOP can be considered a reference to an object, although the basic type bits in Java are fixed, but the reference type (simplified C language pointer) The address used to point to the heap is naturally extended to the length of the machine. The 32-bit system has a maximum accessible memory of 4G, in order to break this limit 64-bit systems are already very common, but the mere reference from 32 to 64 bits, the heap footprint will probably increase by half, although the memory is very cheap, but the memory bandwidth, The CPU cache cost is very expensive.
Compressedoops compresses manageable references to 32-bit to reduce heap footprint, adding codec/decode directives when the JVM executes, similar to 8086 segment management, which uses
<narrow-oop-base (64bits) > + (<narrow-oop (32bits) ><< 3) +<field-offset> formula determines the memory address.
The JVM encodes objects when they are stored in the heap and decodes them when they are read from the heap.
The zero based compressed oops further places the base site at 0 (not necessarily the memory space address is 0, but the JVM has a relative logical address of 0, as the available CPU registers are relatively addressed) and the conversion formula becomes:
(<narrow-oop << 3) +<field-offset>
This further improves performance. However, this requires OS support.
If the Java heap <4g,oops uses a low virtual address space, it does not need to be encoded/decoded for direct use.
The Zero based compressed oops uses multiple policies for different heap sizes.
1. Heap less than 4G, no need to encode/decode operation.
2. Less than 32G and greater than 4G, use zero based compressed oops
3. Greater than 32G, do not use compressed oops.
Escape Analysis Improvements
When a variable (or object) is allocated in a method, its pointer may be returned or globally referenced, which is referred to by other procedures or threads, which is called a pointer (or reference) escape (escape), which means that the variable is not only used within this method. Java objects are generally considered to be always allocated in the heap, which makes it necessary for any object to be garbage collected. In most cases, the objects within the method are used only in this method and can be stored entirely using stacks, which are the most natural and performance-free in the stack. The struct in C is allocated in the stack. If you implement a reference escape analysis, you can assign an object without a reference escape to the stack, and you don't have to add a new definition method to the language, and the reference escape analysis is automatic. JDK7 has started the default supported escape analysis. This also eliminates synchronization, and if its analysis learns that the object is a non-reference escape, all synchronization of that object can be canceled (this is, of course, a programmer's task, such as StringBuffer), Some or all of the other optimized objects are saved in the CPU register.
NUMA Collector Enhancements
Numa (Non Uniform memoryaccess), Numa is implemented in a variety of computer systems, in short, segmented memory access, similar to a cluster in the raid,oracle of a hard disk, which the JVM simply applies.
The above three features can also be opened in some JDK6, depending on the version of the CHANGENOTES.JAVA6 to add such as the following performance optimization measures:
Lightweight locks Use the CAS mechanism to reduce the performance consumption of locks.
Bias lock (biased locking)
Lock coarsening (lock coarsening)
The lock omit escape analysis generated by Escape (escape) analysis can also be allocated within the stack to reduce the pressure of memory recycling.
Adaptive spin Lock (adaptive spinning) spin lock is only effective in physical multi-CPU.
Lock removal (elimination)
In multicore CPUs, lock acquisition is much more expensive than a single-core system, because in multicore systems, lock acquisition requires a CPU-blocking data bus and cache writeback.
So sometimes, in a single-core system, we often get stringbuffer and StringBuilder performance of the use cases, and because of the lock elimination and other technologies, some cases in the multi-core CPU will also get a little performance difference.
It is believed that JAVA7 will also support OpenGL's acceleration function by default.
Class Data sharing is added to the JDK1.5, which is to cache some of the commonly used Java basic classes in a file or shared memory for use by all Java processes.
From JRE1.5, when the Java program starts, such as non-use-client|server instruction display specified, the virtual opportunity to automatically select the corresponding VM, such as in 64-bit system, only implemented SERVERVM, All virtual machines will use the servervm.32-bit system, Windows defaults to CLIENTVM, and Linux,solaris determines whether to use SERVERVM based on the number of CPUs and memory, such as Jre6 to 2CPU,2GB physical memory.
Gc
There are two parameters to measure GC efficiency, one is throughput (i.e. efficiency), one is pause time, and the other is footprint, which is the heap size occupied.
GC algorithm.
1. Copy, the whole block of memory can be recycled after moving all surviving objects to another piece of memory. This method is efficient, but requires a certain amount of free memory, and the copy also has overhead.
2. Trace collector to trace the object reference graph that is collected to track starting from the root node. The basic tracking algorithm is called "Mark and Clear", which is the two phases of garbage collection. Tag stage, the garbage collector iterates through the number of references, marking each encountered object. During the purge phase, objects that are not marked are freed. It is possible to set the tag on the object itself, or to set the tag with a separate bitmap. Compression (optional), garbage collection should also be a task for defragmenting. Tagging and purging typically use two strategies to eliminate heap fragmentation: compression and copying, both of which are fast moving objects to reduce fragmentation and add together called Mark-sweep-compact.
3. There is also a reference count collector, which has a reference count for each object in the heap, minus 1 when referencing an assignment with 1, NULL, or as a reference to a base type that is out of life (such as a method exit and stack reclaim), and whose circular reference to multiple objects is not 0, but the reference count is not and the increase or decrease of the reference number brings additional overhead, so it is no longer used.
Generational collectors
According to the program statistics, most objects have a short life cycle and are quickly released. But there are also some objects that have a long life cycle, or even a permanent effect. For the copy algorithm, all active objects are moved to move at each collection. For short life of the object is good to say, often can be solved in situ, but for the long life cycle of the object is purely a manual work, it moved to remove the consumption of a lot of time, did not produce any benefits. Generational collections can directly allow long life-cycle objects to stay in one place for long periods of time. The GC's energy can be spent more on collecting short-lived objects.
In this method, the heap is divided into two or more sub-heaps, each of which serves as a "generation" object. The youngest generation has the most frequent garbage collection. Because most objects are short-lived, only a small subset of young objects can survive the first collection. If one of the youngest objects is still alive after several garbage collections, the object becomes a higher-life generation, which is transferred to a different sub-heap. The collection of older generations is not frequent in the younger generation. Each time an object becomes mature in its own age (after multiple garbage collection survives), it can be transferred to a higher-age generation.
Generational collection is generally applied to the copy algorithm in the young heap, and the old generation is applied to the mark clearing algorithm. In either case, grouping the heap according to the age of the object can improve the performance of the most basic garbage collection.
In general, the generational collector in Java divides the heap into younger generations, older generations, and permanent generations. The recovery of the young generation is called MINORGC, because in this period, the object life cycle is very relatively high efficiency, the old generation called FULLGC, the corresponding efficiency is lower, time is longer, should try to reduce the number of FULLGC.
Vm
The Client VM is suitable for desktop applications, fast startup, short running time, so it does not pre-load too many classes, too much optimization of the class.
The Server VM is suitable for service programs, the start time is not important, the running time is longer, the most basic classes are pre-loaded, and the class is optimized.
GC type.
Serial serial collector (default)
When the GC is running, the application logic is all paused, using a single-threaded "copy" for the generation of garbage collection, a single thread using "mark-clear-compress" for the old Generation (tenured) garbage collector. High throughput rate. Suitable for single CPU hardware.
Parallel Parallel collector
For younger generations using multiple GC threads for "copy" garbage collection, the program pauses for the younger generation of GC runs, and the old generation is still single-threaded using "mark-purge-compress" for old generation garbage collections, while the GC runs, the application is also paused. On large memory, multiprocessor machines, Consider using this PARALLELGC (specified with parameter-XX:+USEPARALLELGC), which, when GC for Younggen, can take advantage of multiple processors, thus reducing the time to pause, but the focus is on improving throughput, but The same algorithm is still used with SERIALGC when it oldgen the GC. So in Jdk5u6, the parallel compactingcollector (specified with the parameter-XX:+USEPARALLELOLDGC) is introduced, and this GC can also benefit from multiple processors for the Oldgen GC. Since the GC for Oldgen is much more time consuming than Younggen, theoretically this garbagecollector can provide better performance, and it is worth noting that Parallel compacting GC will eventually replace PARALLELGC.
Concurrent mark-sweep concurrency Collector.
For younger generation and multi-GC thread "Copy" recycling, this GC also needs to pause the application, but due to the high efficiency of the MINORGC, there is no big pause, and for the older generation use the concurrent way tag-recycle mechanism that runs concurrently with the application, and the steps are subdivided again, part of the stage (initial tag, re-tagging) It also completely causes the app to pause, but the time is short, and most of the time the application is concurrently with a single GC thread, reducing the time it takes for the application to pause. This GC uses a Younggen collection algorithm that is consistent with PARALLELGC, while the GC uses a more complex algorithm for Oldgen, providing a very short pause time. However, complex algorithms also cause greater overhead, and this parallel GC is non-compacting, so it uses a free block list to manage the oldgenheap, and the overhead of allocating space is increased. In some scenarios, A shorter pause time is more important than a larger throughput, so consider using this GC, known as CMSGC.
The incremental collector (train algorithm) is gradually deprecated, and the concurrent GC is selected in the-XINCGC 1.5.
In Sun J2SE 5.0, the so-called behavior-based Parallel collectortuning is introduced, and this tuning method is based on three goal:
Maximum Pause time Goal: Specified with the parameter-xx:maxgcpausemillis=n, the default value is empty. After this parameter is specified, the GC pause time in three memory areas will be kept to a maximum of n milliseconds, and if not, the corresponding memory area will be reduced to shorten the time of GC pause;
Throughput Goal: With the parameter-xx:gctimeratio=n specified, the default value is 99, which is the GC time for the total application run time of 1%. If not, the corresponding memory area will be enlarged to increase the elapsed time between two GC of the application;
Footprint Goal: Because of the current memory flooding, so this Goal is generally not worth attention;
The priority of these three Goal is from top to bottom, that is, first satisfying maximum Pause time Goal, then satisfying throughputgoal, and finally satisfying footprint Goal.
Using the parameters-xloggc:file and-xx:+printgcdetails to print Gclog, and then using Gcviewer to view Gclog, it has the advantage of generating statistics, throughput, maximum and minimum pause times, FULLGC time as a percentage of the total GC time, etc., can be viewed using this tool, but currently only supported to 1.5.
Jconsole is a Java monitoring and management console that allows you to monitor a wide variety of VM resource runtime usage scenarios. Actually in Java5, you need to add a parameter, in Java6 because attach api,jconsole is supported, the JMX agent inside the JVM is loaded automatically.
The Jstat command prints a wide variety of VM statistics, including memory usage, garbage collection time, class loading, and timely compiler statistics. The Jmap command allows you to get the heap histogram and heap dumps at runtime. The Jhat command allows you to analyze heap dumps. The Jstack command allows you to obtain thread stack traces. These diagnostic tools can be attached to any application and do not need to be started in a special way.
Reference http://blog.sina.com.cn/s/blog_6d003f3f0100lmkn.html
How the Java GC works, how to optimize GC performance, and how to interact effectively with the GC