Full gc troubleshooting caused by HashMap in java concurrent environment

Source: Internet
Author: User
Tags memory usage cpu usage


Symptom

A requirement was launched recently, and the code was restructured during the requirement process. About half an hour after the application was released, an alarm was reported on a machine, and the load was too high. Log on to the machine to check the CPU usage and find that the load is normal. Check the CPU usage and find that one core is full and other CPU usage is low. About an hour later, other machines will trigger alarms and immediately roll back the application if the same problem is found.

The application runs on 16 GB memory VM, and the entire JVM11G memory, including the new generation of 3 GB, CMS gc, and JDK7.

The first reaction is that JVM may be performing full gc, because only one thread is full and other threads are suspended by JVM. First go to the application log to check the application running Status. As a result, the log has no output. Jstat-gcutil: check the JVM memory usage. It is found that 100% is used in the Old zone.


Remove service

Considering that full gc causes the RT to become too long, go to ateye to remove the HSF service registered by the application, but the operation fails and the entire JVM has no response.

Retain site

Jmap-F-dump: format = B, file = atw. bin 'jid'
Dumping the entire heap to the local machine. Dumping fails and the JVM is dead.

Jmap-histo 'jid'> histo. log
The histo memory snapshot is retained ;)

Jstack 'jid'> stack. log
JVM thread information is saved successfully :)

After the on-site storage is completed, restart the application.

Preliminary analysis
First, let's take a look at the JVM thread stack information and see if there is any application thread competition resistance. Generally, if a large number of threads are blocked, each thread holds a certain amount of memory, it is likely that the memory is too tight, and these blocked threads have not completed processing the request, and the heap space occupied cannot be recycled by minor gc, resulting in full gc,

Cat stack. log | grep atw | sort | uniq-c | sort-nr | head-10
Results ):

177 at... service. impl... searchInterProduct (AtwSearchServiceImpl. java: 505)
104 at... service. impl... searchOneWay (AtwSearchServiceImpl. java: 260)
80 at ...service.impl.executor...exe cute (OneWayCommonAgentSearchExecutor. java: 419)
70 at ...service.impl.exe cutor... handleFlights (AbstractSearchExecutor. java: 689)
47 at... service. impl... searchOneWay (AtwSearchServiceImpl. java: 257)
31 at ...service.impl.exe cutor... getFlightInfoAndStock (AbstractSearchExecutor. java: 1073)
30 at ...service.impl.exe cutor... getFlightInfoAndStock (AbstractSearchExecutor. java: 1087)
22 at... util. stlog. FarePolicyCounter. addFail (FarePolicyCounter. java: 249)
20 at ...service.impl.executor...exe cute (OneWayCommonAgentSearchExecutor. java: 424)
20 at ...service.impl.exe cutor... getAllFares (AbstractSearchExecutor. java: 892)
200 HSF threads are opened, and the most running threads related to applications are com. taobao. trip. atw. service. impl. atwSearchServiceImpl. searchInterProduct (AtwSearchServiceImpl. java: 505), a total of 177 threads, less than the number of HSF threads, is normal, the number of other threads is also within the normal range. No exception was found in the thread lock and other information.

Next let's take a look at histo. log:

Num # instances # bytes class name
----------------------------------------------
1: 204258 4368429800 [B
2: 6812683 926524888 com. taobao. trip. atw. domain. AtwInterFareDO
3: 22639343 724458976 java. util. HashMap $ Entry
4: 22304135 538457776 [S
5: 21614962 518759088 java. lang. Long
6: 13867918 443773376 com. taobao. trip. atw. util. LongReferenceConcurrentHashMap $ HashEntry
7: 6812439 326997072 com. taobao. trip. atw. domain. AtwInterFareSegmentDO
8: 421442 211696296 [J
9: 557827 199825568 [Ljava. util. HashMap $ Entry;
10: 6812439 163498536 [Lcom. taobao. trip. atw. domain. AtwInterFareSegmentDO;
The largest memory object is a byte array. The 204258 instances occupy approximately 4 GB of heap space (11 GB of the entire heap), with an average of 20 kB.

Preliminary speculation may be that there is still a new byte [] in this launch code, that is, check the code and find that this new function does not have such code. In addition, the entire application code generates a byte array.

Further analysis may be caused by the dependent two-party or third-party jar package. Apply for a new branch, submit a release ticket, and view the release package changes:

The comparison shows that the jar package change involved in this release is small, and there are no third-party package changes, only the internal package changes. No new byte [] is found in the analysis of the changed package.

Continue to analyze histo. log, find an online normal machine, generate histo, and subtract the normal value from the data of the faulty machine. The difference is as follows (top 10 ):

Class Instances Bytes
[B 47404 4275481936
Java. util. HashMap $ Entry 19442009 622144288
Java. lang. Long 19711584 473078016
[Ljava. util. HashMap $ Entry; 239216 164734560
Com. taobao. at. dal. common. dataobject. AtwMessage 359668 20141408
Java. util. HashMap 215770 10356960
Java. util. concurrent. Javasblockingqueue $ Node 421036 10104864
Com. taobao. trip. atw. metaq. service. common. LocalMessageReactor $1 359769 8634456
Com. alibaba. rocketmq. common. message. MessageExt 65151 6775704


In addition to byte [], java. util. HashMap $ Entry is 2kw more than normal Machines. There is no clear evidence to view the code that can explain the scenario where HashMap and byte [] increase at the same time.

So far, new clues need to be found for the analysis of ideas blocking.

Identify new clues through GC logs
Through the above analysis, we have found that the application has a full gc, and a large number of byte [] and java. util. HashMap $ Entry cannot be recycled.

However, gc. log, the most direct product of full gc, has not yet been mined. Based on the full gc time point, new clues are found (retyped ):)

= WARNNING = allocating large array -- thread_id [0x00007f71211b0800] -- thread_name [owc -- 425027705] -- array_size [2132509912 bytes] -- array_length [2132509891 elememts]
Prio = 10 tid = 0x00007f71211b0800 nid = 0x3f43e runnable
At com. alibaba. dubbo. common. io. Bytes. copyOf (Bytes. java: 59)
At com. alibaba. dubbo. common. io... write (UnsafeByteArrayOutputStream. java: 64)
At com.taobao.hsf.com. caucho. hessian. io... flushBuffer (Hessian2Output. java: 1553)
At com.taobao.hsf.com. caucho. hessian. io... printString (Hessian2Output. java: 1466)
At com.taobao.hsf.com. caucho. hessian. io... writeString (Hessian2Output. java: 987)
At com.taobao.hsf.com. caucho. hessian. io... writeObject (BasicSerializer. java: 149)
At com.taobao.hsf.com. caucho. hessian. io... writeObject (Hessian2Output. java: 421)
At com.taobao.hsf.com. caucho. hessian. io... writeObject (MapSerializer. java: 99)
At com.taobao.hsf.com. caucho. hessian. io... writeObject (Hessian2Output. java: 421)
At com.taobao.hsf.com. caucho. hessian. io... serialize (UnsafeSerializer. java: 293)
At com.taobao.hsf.com. caucho. hessian. io... writeInstance (UnsafeSerializer. java: 212)
At com.taobao.hsf.com. caucho. hessian. io... writeObject (UnsafeSerializer. java: 171)
At com.taobao.hsf.com. caucho. hessian. io. H .. writeObject (Hessian2Output. java: 421)
At com. taobao. hsf. remoting. serialize... encode (Hessian2Encoder. java: 23)
At com. taobao. hsf. remoting. server. output... writeHSFResponse (RpcOutput. java: 47)
At com. taobao. hsf. remoting. provider... handleRequest (ProviderProcessor. java: 202)
At com. taobao. hsf. remoting. server... handleRequest (RPCServerHandler. java: 47)
At com. taobao. hsf. remoting. server.. r. handleRequest (RPCServerHandler. java: 25)
...
The JVM customized by Alibaba has added many new features, one of which is that when full gc cannot be recycled, the thread information of the maximum memory allocated and the allocated memory information are printed!

= WARNNING = allocating large array-thread_id [0x00007f71211b0800]-thread_name [owc-425027705]-array_size [2132509912 bytes]-array_length [2132509891 elememts]

Thread owc -- 425027705, which is a thread in which the application processes HSF requests. It is allocating a huge data group! The gc log stack information shows that the current thread is processing the copy of byte:

At com. alibaba. dubbo. common. io. Bytes. copyOf (Bytes. java: 59)

This copy process is generally used to serialize the processing result into byte [] after the application processes the HSF request, and then transmit it to the calling machine through the network.

The cause of byte [] is found, and java. util. HashMap $ Entry is not resolved.

Based on the thread name owc -- 425027705, the JVM thread log is searched for information. It is found that owc -- 425027705 is the main thread processing the request. The following four sub-threads are processing such stacks:

"Owc -- 425027705-344" daemon prio = 10 tid = 0x00007f710278f800 nid = 0x3f414 runnable [0x0000000051906000]
Java. lang. Thread. State: RUNNABLE
At java. util. HashMap. getEntry (HashMap. java: 469)
At java. util. HashMap. get (HashMap. java: 421)
At com. taobao. trip. atw. result. impl. DefaultPriceMergerOW. processHeightQuality (DefaultPriceMergerOW. java: 327)
At com. taobao. trip. atw. result. impl. DefaultPriceMergerOW. extendedProductProcess (DefaultPriceMergerOW. java: 179)
At com. taobao. trip. atw. result. impl. DefaultPriceMergerOW. mergeOneWay (DefaultPriceMergerOW. java: 137)
At com. taobao. trip. atw. result. PriceMergerProxy. mergeOneWay (PriceMergerProxy. java: 184)
...
The child thread is getting data from HashMap! As we have encountered a HashMap multi-thread operation that leads to a circular data structure, and then the get operation becomes an endless loop, it is determined that it is a HashMap problem!

The reason why HashMap has become an endless loop under multiple threads

In short, put operations on HashMap under multiple threads will lead to the formation of a circular data structure in the internal Entry linked list.
First, the put Operation checks whether the capacity is sufficient. If the capacity is insufficient, the internal array is resize.

Void addEntry (int hash, K key, V value, int bucketIndex ){
If (size> = threshold) & (null! = Table [bucketIndex]) {
Resize (2 * table. length );
Hash = (null! = Key )? Hash (key): 0;
BucketIndex = indexFor (hash, table. length );
    }

CreateEntry (hash, key, value, bucketIndex );
}
The problem is that resize traverses the linked list of the Entry:

Void addEntry (int hash, K key, V value, int bucketIndex ){
If (size> = threshold) & (null! = Table [bucketIndex]) {
Resize (2 * table. length );
Hash = (null! = Key )? Hash (key): 0;
BucketIndex = indexFor (hash, table. length );
    }

CreateEntry (hash, key, value, bucketIndex );
}
In the case of multiple threads, such code may have loops.

For ring-forming Map, get traversal of the Entry linked list will lead to an endless loop:

Final Entry <K, V> getEntry (Object key ){
If (size = 0 ){
Return null;
    }

Int hash = (key = null )? 0: hash (key );
For (Entry <K, V> e = table [indexFor (hash, table. length)];
E! = Null;
E = e. next ){
Object k;
If (e. hash = hash &&
(K = e. key) = key | (key! = Null & key. equals (k ))))
Return e;
    }
Return null;
}       
Why does multiple threads operate on a HashMap at the same time?

Processing logic

After receiving the request, the main thread will allocate four subthreads to calculate the result and then merge the results. If the sub-thread fails to process or times out, the results of the sub-thread will be discarded and not merged.

From the log, the processing of the four sub-threads has timed out. However, due to the concurrent HashMap operations, the four sub-threads are still running, and the main thread discards the sub-threads, the quantity should be very small. Why is this large byte []?

Trace the source and find the clues from the task allocation in the main thread. The result objects processed by the four subthreads are all copied from the main thread:

BeanUtils. copyProperties (main, rsp );
This modification adds a HashMap to the result object:

Private Map <Long, Map <ItemGroupType, ItemDO> agentItemGroup;
   
Public Map <Long, Map <ItemGroupType, ItemDO> getAgentItemGroup (){
If (agentItemGroup = null ){
AgentItemGroup = new HashMap <Long, Map <ItemGroupType, ItemDO> ();
        }
Return agentItemGroup;
    }

Public void setAgentItemGroup (Map <Long, Map <ItemGroupType, ItemDO> agentItemGroup ){
This. agentItemGroup = agentItemGroup;
    }
The get method of agentItemGroup will judge whether it is null. If yes, a new map will be generated.

In org. springframework. beans. beanUtils # copyProperties (java. lang. object, java. lang. object) method, the get/set method is called for Object attribute assignment (refer to: org. springframework. beans. beanUtils # copyProperties (java. lang. object, java. lang. object, java. lang. class, java. lang. string []) in this way, the map used by the four sub-threads is the same as that of the main thread, and even if the sub-thread results are abandoned, the map of the main thread has been broken.

Com.taobao.hsf.com. caucho. hessian. io. MapSerializer. writeObject (MapSerializer. java: 99)

HSF serializes the traversal Map during Map serialization:

Public void writeObject (Object obj, AbstractHessianOutput out) throws IOException {
If (! Out. addRef (obj )){
Map map = (Map) obj;
Class cl = obj. getClass ();
If (! Cl. equals (HashMap. class) & this. _ isSendJavaType & obj instanceof Serializable ){
Out. writeMapBegin (obj. getClass (). getName ());
} Else {
Out. writeMapBegin (String) null );
            }

Iterator iter = map. entrySet (). iterator ();

While (iter. hasNext ()){
Entry entry = (Entry) iter. next ();
Out. writeObject (entry. getKey ());
Out. writeObject (entry. getValue ());
            }

Out. writeMapEnd ();
        }
    }
Since the map of the main thread has become a circular data structure, the traversal iterator will execute in an endless loop.

At this point, the full gc problem has been fully investigated. Solution:

BeanUtils. copyProperties (main, rsp );
Rsp. setAgentItemGroup (new HashMap <Long, Map <ItemGroupType, ItemDO> ());
Summary
In a concurrent environment, HashMap has more than one pitfall. In many cases, writing code does not consider the concurrent scenario. The well-known code is already running in the concurrent environment. This leads to a big mistake. In fact, later I thought, HashMap can also make some improvements. If the number of loops in get exceeds the size, an exception will be thrown, which will not lead to an endless loop or full gc.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.