When I looked at the memory management glossary, I stumbled upon the definition of "Pig in the Python (note: a bit like the Centerbur snake in Chinese)", so I had this article. On the surface, the term says that GC constantly promotes large objects from one generational generation to another. Doing so is like a python devouring its prey, so that it cannot move during digestion.
For the next 24 hours my mind was filled with the image of this suffocating python, lingering. As the psychiatrist has said, the best way to eliminate fear is to say it. So there was this article. But the next story we're going to talk about is not the BOA, but the GC's tuning. I swear to God.
It is well known that GC pauses can easily lead to performance bottlenecks. The modern JVM comes with the advanced garbage collector when it is released, but from my experience, it's even harder to find the optimal configuration for an application. There may still be a glimmer of hope for manual tuning, but you have to understand the exact mechanism of the GC algorithm. This article will help you with this, and I'll give you an example of how a small change in the JVM configuration affects the throughput of your application.
Example
The application we use to demonstrate that GC has an impact on throughput is a simple program. It contains two threads:
pigeater– it will imitate the python's incessant ingestion of the big fat Pig. The code does this by adding 32MB bytes to the java.util.List, and then sleeps 100ms after each swallow.
pigdigester– it simulates the process of asynchronous digestion. The code that implements digestion is simply to empty the list of pigs. Since this is a very tiring process, the thread will sleep 2000ms every time the reference is cleared.
All two threads run in a while loop, eating and digesting until the snake is full. This would probably have to eat 5000 pigs.
Copy Code code as follows:
Package Eu.plumbr.demo;
public class Piginthepython {
static volatile List pigs = new ArrayList ();
static volatile int pigseaten = 0;
static final int enough_pigs = 5000;
public static void Main (string[] args) throws Interruptedexception {
New Pigeater (). Start ();
New Pigdigester (). Start ();
}
Static Class Pigeater extends Thread {
@Override
public void Run () {
while (true) {
Pigs.add (New BYTE[32 * 1024 * 1024]); 32MB per Pig
if (Pigseaten > Enough_pigs) return;
TAKEANAP (100);
}
}
}
Static Class Pigdigester extends Thread {
@Override
public void Run () {
Long start = System.currenttimemillis ();
while (true) {
Takeanap (2000);
Pigseaten+=pigs.size ();
Pigs = new ArrayList ();
if (Pigseaten > Enough_pigs) {
System.out.format ("Digested%d pigs in%d ms.%n", Pigseaten, System.currenttimemillis ()-start);
Return
}
}
}
}
static void Takeanap (int ms) {
try {
Thread.Sleep (MS);
catch (Exception e) {
E.printstacktrace ();
}
}
}
Now we define the throughput of this system as "the number of pigs that can be digested per second". Considering that every 100ms of pigs will be stuffed into this boa, we can see that the maximum throughput of this system can be up to 10 head/sec.
GC Configuration Example
Let's look at the performance of using two different configuration systems. Regardless of the configuration, the application runs on a Mac (OS X10.9.3) with dual-core, 8GB memory.
First configuration:
1.4G of Heap (-xms4g-xmx4g)
2. Use CMS to clean up old age (-XX:+USECONCMARKSWEEPGC) use parallel collector to clean up Cenozoic (-XX:+USEPARNEWGC)
3. Allocate the 12.5% (-xmn512m) of the heap to the Cenozoic and limit the size of the Eden and survivor areas to the same.
The second configuration is slightly different:
1.2G of Heap (-XMS2G-XMS2G)
2. Parellel GC (-XX:+USEPARALLELGC) was used in the Cenozoic and old eras
3. Allocating 75% of the heap to the Cenozoic (-xmn 1536m)
4. Now is the time to bet, which configuration performance will be better (that is, how many pigs can eat per second, remember it)? The guys who put their chips on the first one, you're going to be disappointed. The result is just the opposite:
1. The first configuration (large heap, large old age, CMS GC) can devour 8.2 pigs per second
2. Second configuration (small heap, large Cenozoic, Parellel GC) can swallow 9.2 pigs per second
Now let's look at the results objectively. The allocated resources are twice times less, but the throughput has increased by 12%. This is just the opposite of common sense, so it is necessary to further analyze what is going on.
Analyzing the results of a GC
The reason is not complicated, you just have to look at what the GC is doing when you run the test to find the answer. You can choose the tools you want to use. With the help of Jstat, I found the secret behind, and the order is probably this:
Copy Code code as follows:
By analyzing the data, I noticed that configuration 1 went through 1129 GC cycles (YGCT_FGCT), and spent 63.723 seconds in total:
Copy Code code as follows:
Timestamp s0c s1c s0u s1u EC EU OC OU PC PU YGC YGCT FGCT GCT
594.0 174720.0 174720.0 163844.1, 0.0 174848.0 131074.1 3670016.0 2621693.5 21248.0 2580.9 1006 63.182 116 0.236 63.419
595.0 174720.0 174720.0 163842.1, 0.0 174848.0 65538.0 3670016.0 3047677.9 21248.0 2580.9 1008 63.310 117 0.236 63.546
596.1 174720.0 174720.0 98308.0, 163842.1 174848.0 163844.2 3670016.0 491772.9 21248.0 2580.9 1010 63.354 118 0.240 63.595
597.0 174720.0 174720.0 0.0, 163840.1 174848.0 131074.1 3670016.0 688380.1 21248.0 2580.9 1011 63.482 118 0.240 63.723
The second configuration was suspended 168 times (YGCT+FGCT) and took only 11.409 seconds.
Copy Code code as follows:
Timestamp s0c s1c s0u s1u EC EU OC OU PC PU YGC YGCT FGCT GCT
539.3 164352.0 164352.0 0.0 0.0 1211904.0 98306.0 524288.0 164352.2 21504.0 2579.2 27 2.969 141-8.441 11. 409
540.3 164352.0 164352.0 0.0 0.0 1211904.0 425986.2 524288.0 164352.2 21504.0 2579.2 27 2.969 141-8.441 11. 409
541.4 164352.0 164352.0 0.0 0.0 1211904.0 720900.4 524288.0 164352.2 21504.0 2579.2 27 2.969 141-8.441 11. 409
542.3 164352.0 164352.0 0.0 0.0 1211904.0 1015812.6 524288.0 164352.2 21504.0 2579.2 27 2.969 141-8.441 11.409
Given the amount of work that is equivalent in both cases, it can clean up the garbage more quickly in this pig-eating experiment when the GC does not find a long-lived object. With the first configuration, the GC is likely to run at a rate of 6 to 7 times times, while the total pause time is 5 to 6 times times.
Said the story has two purposes. The first and the most important one, I hope that this is the exhaust of the snake hurriedly from my mind to drive out. Another obvious gain is that--GC tuning is a tricky experience that requires you to know the underlying concepts. Although this is a very common application, the different results of the selection will have a significant impact on your throughput and capacity planning. In real-life applications, the difference here is even greater. So, it depends on how you choose, you can master these concepts, or just focus on your day-to-day work, and let PLUMBR find the most appropriate GC configuration you need.