Analysis on High CPU usage of HBase Thrift2
High-level analysis of HBase Thrift2 CPU usage
Directory
Directory 1
1. symptom 1
2. Problem locating 2
3. Solution 5
4. Related Code 5
1. symptom description
Port 9090 of the external connection times out, but the telnet port always succeeds. Using the top Command, we found that the CPU of a single thread is up to 99.99%, but not always 99.9%, but fluctuating. When the traffic is migrated to the machine, the access is successful, but there is still a timeout. The read timeout is more than the write Timeout:
#./Hbase_stress -- hbase = 110.13.136.207: 9090 -- test = 2 -- timeout = 10 [2016-11-2710: 15: 21/771] [139756154767104/31562] [ERROR] [hbase_stress.cpp: 302] TransportException (thrift: // 110.13.136.207: 9090): EAGAIN (timedout) [2016-11-2710: 15: 31/775] [139756154767104/31562] [ERROR] [hbase_stress.cpp: 302] TransportException (thrift: // 110.13.136.207: 9090): EAGAIN (timedout) Piduserprni1_resshrs % CPU % MEMTIME + COMMAND 20727zhangsan20010. 843g9. 2621326344r99. 926.41448: 00 java 20729zhangsan20010. 843g9. 2621326344r99. 926.41448: 00 java 20730zhangsan20010. 843g9. 2621326344r99. 926.41449: 10 java 20728zhangsan20010. 843g9. 2621326344r99. 826.41448: 00 java 20693zhangsan20010. 843g9. 2621326344s0. 026.40: 00.00 java 20727zhangsan20010. 843g9. 2621326344r75. 526.41448: 06 java 20728zhangsan20010. 843g9. 2621326344r75. 226.41448: 06 java 20729zhangsan20010. 843g9. 2621326344r75. 226.41448: 06 java 20730zhangsan20010. 843g9. 2621326344r75. 226.41449: 15 java 20716zhangsan20010. 843g9. 2621326344s24. 926.493: 48.75 java |
2. Problem locating
Use the ps command to find the most CPU threads, which is consistent with the top display:
$ Ps-mp20693-oTHREAD, tid, time | sort-rn Zhangsan18.819 ---- 207301-00:11:23 Zhangsan18.719 ---- 207291-00:10:13 Zhangsan18.719 ---- 207281-00:10:13 Zhangsan18.719 ---- 207271-00:10:13 Zhangsan16.119-futex _-- 2073120: 44: 51 Zhangsan5.219-futex _-- 2073206: 46: 39 |
Then, with the help of jstack, the GC process is discovered:
"Gangworker #0 (ParallelCMSThreads)" OS _prio = 0tid = 0x00007fb7200d4000nid = 0x50f7runnable "Gangworker #1 (ParallelCMSThreads)" OS _prio = 0tid = 0x00007fb7200d5800nid = 0x50f8runnable "Gangworker #2 (ParallelCMSThreads)" OS _prio = 0tid = 0x00007fb7200d7800nid = 0x50f9runnable "Gangworker #3 (ParallelCMSThreads)" OS _prio = 0tid = 0x00007fb7200d9000nid = 0x50farunnable |
Check the GC using jstat. The problem is caused by GC:
$ Jstat-gcutil206931000100 S0s1eomccsygcygctfg?ctgct 0.0099.67100.00100.0098.0894.41421990000.1322708434869.60135238.733 0.0099.67100.00100.0098.0894.41421990000.1322708434870.44835239.580 0.0099.67100.00100.0098.0894.41421990000.1322708434870.44835239.580 0.0099.67100.00100.0098.0894.41421990000.1322708434870.44835239.580 $ Jstat-gccapacity20693 NGCMNNGCMXNGCS0CS1CECOGCMNOGCMXOGCOCMCMNMCMXMCCCSMNCCSMXCCSCYGCFGC Bytes $ Jstat-gcold20693 MCMUCCSCCCSUOCOUYGCFGCFGCTGCT 31584.030978.734244153232.78094144.08094144.0421992717434964.10935333.479 $ Jstat-gcoldcapacity20693 Ogcmnogcmxogcocygw.ctgct 3836801168094144.08094144.08094144.0421992719234982.62335351.755 $ Jstat-gcnewcapacity20693 NGCMNNGCMXNGCS0CMXS0CS1CMXS1CECMXECYGCFGC Bytes $ Jstat-gc20693 S0CS1CS0US1UECEUOCOUMCMUCCSCCCSUYGCYGCTFGCFGCTGCT Bytes $ Jstat-gcnew20693 S0CS1CS0US1UTTMTTDSSECEUYGCYGCT 11072010911070000000000110396.96655360.088608000008860800000421990000.132 |
When lsof is used, the number of connections of the process is not large. It is completely within the security scope. The problem should be that some objects cannot be recycled. Use jmap to view the memory details. First view the usage of the heap:
$ Jmap-heap20693 AttachingtoprocessID20693, pleasewait... Debuggerattachedsuccessfully. Servercompilerdetected. JVMversionis25.77-b03 Usingparallelthreadsinthenewgeneration. Usingthread-localobjectallocation. ConcurrentMark-SweepGC HeapConfiguration: MinHeapFreeRatio = 40 MaxHeapFreeRatio = 70 MaxHeapSize = 9422503936 (8986.0 MB) NewSize = 196411392 (187.3125 MB) MaxNewSize = 1134100480 (1081.5625 MB) OldSize = 392888320 (374.6875 MB) NewRatio = 2 Required vorratio = 8 MetaspaceSize = 21807104 (20.796875 MB) CompressedClassSpaceSize = 1073741824 (1024.0 MB) MaxMetaspaceSize = 17592186044415 MB G1HeapRegionSize = 0 (0.0 MB) HeapUsage: NewGeneration (Eden + 1 into vorspace ): Capacity = 1020723200 (973.4375 MB) Used = 1020398064 (973.1274261474609 MB) Free = 325136 (0.3100738525390625 MB) 99.96814650632022% used EdenSpace: Capacity = 907345920 (865.3125 MB) Used = 907345920 (865.3125 MB) Free = 0 (0.0 MB) 100.0% used FromSpace: Capacity = 113377280 (108.125 MB) Used = 113052144 (107.81492614746094 MB) Free = 325136 (0.3100738525390625 MB) 99.71322649476156% used ToSpace: Capacity = 113377280 (108.125 MB) Used = 0 (0.0 MB) Free = 113377280 (108.125 MB) 0.0% used Concurrentmark-sweepgeneration: Capacity = 8288403456 (7904.4375 MB) Used = 8288403424 (7904.437469482422 MB) Free = 32 (3.0517578125E-5 MB) 99.9999996139184% used 10216internedstringsoccupying942540bytes. |
Further view the object information:
$ Jmap-histo20693 Num # instances # bytesclassname ---------------------------------------------- 1: 728352122518411456 [B 2: 498271471993085880java. util. TreeMap $ Entry 3: 12855993617087664java. util. TreeMap 4: 4285217445662568org. apache. hadoop. hbase. client. client.pdf 5: 4285222377099536org. apache. hadoop. hbase. client. Scan 6: 4284875377069000org. apache. hadoop. hbase. client. ScannerCallable 7: 4285528342921344 [Ljava. util. HashMap $ Node; 8: 4284880308511360org. apache. hadoop. hbase. client. ScannerCallableWithReplicas 9: 8570671274261472java. util. Collections list 10: 4285579205707792java. util. HashMap 11: 4285283204103584org. apache. hadoop. hbase. client. RpcRetryingCaller 12: 3820914152836560org. apache. hadoop. hbase. filter. SingleColumnValueFilter 13: 4291904137340928java. util. concurrent. ConcurrentHashMap $ Node 14: 8570636137130176java. util. TreeMap $ EntrySet 15: 4285278137128896org. apache. hadoop. hbase. io. TimeRange 16: 8570479137127664java. util. concurrent. atomic. AtomicBoolean 17: 289140992525088org. apache. hadoop. hbase. NoTagsKeyValue 18: 428654068584640java. lang. Integer 19: 428529868564768java. util. TreeMap $ KeySet 20: 428527568564400java. util. TreeSet 21: 428500668560096java. util. HashSet 22: 428485168557616java. util. HashMap $ KeySet 23: 317611850817888org. apache. hadoop. hbase. filter. BinaryComparator 24: 10933607600 [Ljava. util. concurrent. ConcurrentHashMap $ Node; 25: 41877518479112 [Lorg. apache. hadoop. hbase. Cell; 26: 67144317693224 [C 27: 41878426751240org. apache. hadoop. hbase. client. Result 28: 66973916073736java. lang. String 29: 64479615475w.org. apache. hadoop. hbase. filter. SubstringComparator 30: 41913410059216java. util. Configure list $ Node |
In order to make the system work normally, we should first implement a solution to temporary and non-temporary problems: monitor GC, restart the hbasethrit2 process on a regular basis, and then find out the root cause to achieve the ultimate goal.
From the output of jmap above, we can see whether it is caused by the failure of the quota. There are two reasons why the consumer is not closed: one is that the client program is not closed due to a problem, that is, memory leakage. The other is that the client program fails to be closed due to exceptions.
Check the source code of the client. If an opentracing exception exists, it is not closed. In addition, if the client is killed or powered off, it cannot be released. This is a problem solved by HBaseThrift2.
3. Solution
Solve the preceding problems:
1) The client ensures that all slaves are released;
2) hbasethrifle T2 increases automatic release of unoperated shards for a long time;
3) You can also use getScannerResults instead of getScannerRows to avoid this problem.
Patch:
Https://issues.apache.org/jira/browse/hbase-17182.
4. Related code
PrivatefinalMap <Integer, ResultScanner scanner> scannerMap = NewConcurrentHashMap <Integer, resulttables> (); @ Override Public?pen=( ByteBuffertable, TScanscan) throwsTIOError, TException { Tablehtable = getTable (table ); Resultscannerresultsuccess = null; Try { Resultresponse = htable. getresponse (scanFromThrift (scan )); } Catch (io1_tione ){ ThrowgetTIOError (e ); } Finally { CloseTable (htable ); } // Add the rule to scannerMap, // If closeScanner is not called by the client, the memory will be leaked. GC cannot recycle and change part of the memory. Returnaddtasks (resulttasks ); } /** * AssignsauniqueIDtothescannerandaddsthemappingtoaninternalHashMap. * @ Paramscannertoadd * @ ReturnIdforthisScanner */ PrivateintaddScanner (resultscanner ){ Intid = nextScannerId. getAndIncrement (); ScannerMap. put (id, callback); // puts the callback in scannerMap Returnid; } /** * ReturnstheScannerassociatedwiththespecifiedId. * @ ParamidoftheScannertoget * @ ReturnaScanner, ornulliftheIdisinvalid */ Privateresultscannergetd (intid ){ ReturnscannerMap. get (id ); } @ Override Publicvoidcloseent (intscannerId) throwsTIOError, TIllegalArgument, TException { LOG. debug ("scannerClose: id =" + scannerId ); Resultscanneridentifier = getcounter (scannerId ); If (Bytes = null ){ Stringmessage = "scannerIDisinvalid "; LOG. warn (message ); TIllegalArgumentex = newTIllegalArgument (); Ex. setMessage ("InvalidscannerId "); Throwex; } Closing. close (); // close closing RemoveScanner (scannerId); // remove a role from scannerMap } /** * RemovesthescannerassociatedwiththespecifiedIDfromtheinternalHashMap. * @ ParamidoftheScannertoremove * @ ReturntheremovedScanner, or <code> null </code> iftheIdisinvalid */ Protectedresultscannerremove.pdf (intid ){ ReturnscannerMap. remove (id); // remove a worker from scannerMap } |