I. Tuning needs to focus on several aspects
- Memory tuning
- Tuning CPU Usage
- Lock Competition Tuning
- I/O tuning
Two. Twitter's biggest enemy: latency
What are some of the causes of the delay?
- The biggest impact factor is the GC
- Others are: Lock and thread scheduling, I/O, algorithm data structure selection is inefficient
Three. Memory Performance tuning
(1) Memory consumption tuning
OutOfMemoryError exception Reason: May be true data volume is too large, may want to display too much data, possible memory leaks
The amount of data is too large to observe and solve:
- View GC logs to see memory changes before and after full GC, changes do not indicate that the amount of data is too large
- Try increasing the memory usage of the JVM
- Do you think that this data really needs to be in memory? can be considered using: LRU algorithm swap out and so on, weak references (Soft References)
Data bloat (Fat)
- When you want to do something weird, it happens. Data takes up too much of a problem, such as: load the entire social graph onto a single JVM instance, load all the user's metadata onto a single JVM instance
- Reduce internal data presentation work on a scale like Twitter
Data bloat Reason:
(1) The object head (the JVM object head generally occupies two machine code, occupies 64bit on the 32-bit JVM, occupies 128bit on the 64-bit jvm bytes, for example: New Java.lang.Object () occupies bytes; New Byte[0] bytes) More Object Header content reference: http://blog.csdn.net/wenniuwuren/article/details/50939410
(2) Fill complete
Look at an example.
public static class D { byte D1; } public static class E extends D { byte e1; }
New D () occupies a space of bytes, and new E () occupies a space of up to bytes. Specific spatial calculation reference: http://blog.csdn.net/wenniuwuren/article/details/50958892
It is now generally 64-bit that the jvm,64-bit pointer will cause the CPU cache to be much less than 32-bit pointers, so it is recommended that the JVM parameter be added-xx:+usecompressedoops use pointer compression to compress the 64-bit pointer to 32-bit, But can use 64-bit memory space, achieve double benefit function. In addition, the maximum recommended heap is less than 30G.
Try not to use the wrapper class of the original type Object
In Scala 2.7.7: Seq[int] Save Int, the first space occupies (+ 32*length) bytes, and the second space occupies (+ 4*length) bytes.
This issue was fixed in Scala 2.8, where we can see:
- You don't know the performance characteristics of the class library you're using (for example, you can use int for int)
- You may never know the problem unless you run it under the profiling tool
Map space occupancy (map footprints)
- Guava Mapmaker.makemap () occupies 2272 bytes
- Mapmaker.concurrencylevel (1). Makemap () occupies 352 bytes
Use Thread Local carefully
A typical problem is the resource-related m*n of the online pool, such as the 200 thread pool that uses 50 connections and eventually 10,000 connection caches
Consider using a synchronization object or create a new object at a time
Four. Fight against delays
Performance Triangle
Figure 1: Lower memory footprint, lower latency, higher throughput
Figure 2: Compression (compactness, which reduces memory footprint), increased spit volume, and higher response rate
How does the new generation work?
- All new objects are allocated in Eden generation, because the Cenozoic GC has compression, so memory allocation with pointer collisions
- When Eden is full, make a stop-the-world Minor GC, and survive to Survivor.
- After several Minor GC, the surviving objects will be lifted (tenured) to the old age
Idealized New generation operation
- The Eden generation is sufficient to accommodate more than one set of concurrent request and response objects (so that there is no stop-the-world and the throughput is higher)
- Each Survivor space is sufficient to accommodate active objects and age-appropriate objects (reducing premature ascension to the old age)
- The elevation threshold is just the right time for long-lived objects to be promoted to older generations (making room for Survivor).
Start tuning from the new generation
- Print verbose GC logs, such as opening JVM parameters:-xx:+printgcdetails,-xx:+printgcdatestamps,-xx:+printheapatgc,-xx:+printtenuringdistribution Wait a minute...
- Focus on Survivor size, set the appropriate Survivor size
- Focus on the elevation threshold, allowing long-lived objects to rise rapidly to the old age
(1)-XX:+PRINTHEAPATGC
Heap after GC Invocations=1 (full 0): par new generation total 943744K, used 54474K [0x0000000757000000, 0x0000000 797000000, 0x0000000797000000) Eden Space 838912K, 0% used [0x0000000757000000, 0x0000000757000000, 0x000000078a340000) from space 104832K, 51% used [0x00000007909a0000, 0X0000000793ED2AE0, 0x0000000797000000 ) to Space 104832K, 0% used [0x000000078a340000, 0x000000078a340000, 0x00000007909a0000) concurrent Mark-sweep generation Total 1560576K, used 0K [0x0000000797000000, 0x00000007f6400000, 0x00000007f6400000) Concurrent-mark-sweep Perm Gen Total 159744K, used 38069K [0x00000007f6400000, 0x0000000800000000, 0x0000000800000000) c11/>}
(2)-xx:+printtenuringdistribution
Desired survivor size 53673984 bytes, new Threshold 4 (max 6) -age 1: 9165552 bytes, 9165552 Total - Age 2: 2493880 bytes, 11659432 Total – Age 3: 6817176 bytes, 18476608 all -age 4: 36258736 bytes, 54735344 total: 899459k->74786k (943744K), 0.0654030 secs] 1225769k-> 401096K (2504320K), 0.0657530 secs] [times:user=0.55 sys=0.00, real=0.07 secs]
CMS Tuning
- The CMS collector needs more memory and allocates as much as possible.
- Reduce fragmentation and avoid full GC
- -xx:cmsinitiatingoccupancyfraction=n n is typically set to 75-80 (too early to start to reduce throughput, too late to start causing concurrent mode failed)
Is the response still too slow?
- Minor GC When there are too many surviving objects, try to reduce the Cenozoic space, reduce the Survivor space, reduce the promotion threshold
- Too many threads. Try to find the smallest concurrency level or add more JVM instances
- Try using Volatile instead of synchronized to reduce lock contention and try using Atomic* 's atomic class
Dealing with the fragmentation problem of CMS with allocation slab
Apache's Cassandra internally uses slab allocations. Each slab size is 2MB, using CAS to replicate byte[] to the inside, using Cassandra before the overhead is 30-60 seconds per hour, after use in 3 days 10 hours overhead 5 seconds.
There are some limitations to using the allocation slab: cache content is written to disk when the cache is full, and objects need to be converted to binary issues.
Twitter engineers talk about JVM tuning