Analysis on High CPU usage of HBase Thrift2

Source: Internet
Author: User
Tags high cpu usage

Analysis on High CPU usage of HBase Thrift2
High-level analysis of HBase Thrift2 CPU usage
Directory

Directory 1

1. symptom 1

2. Problem locating 2

3. Solution 5

4. Related Code 5

1. symptom description

Port 9090 of the external connection times out, but the telnet port always succeeds. Using the top Command, we found that the CPU of a single thread is up to 99.99%, but not always 99.9%, but fluctuating. When the traffic is migrated to the machine, the access is successful, but there is still a timeout. The read timeout is more than the write Timeout:

#./Hbase_stress -- hbase = 110.13.136.207: 9090 -- test = 2 -- timeout = 10

[2016-11-2710: 15: 21/771] [139756154767104/31562] [ERROR] [hbase_stress.cpp: 302] TransportException (thrift: // 110.13.136.207: 9090): EAGAIN (timedout)

[2016-11-2710: 15: 31/775] [139756154767104/31562] [ERROR] [hbase_stress.cpp: 302] TransportException (thrift: // 110.13.136.207: 9090): EAGAIN (timedout)

Piduserprni1_resshrs % CPU % MEMTIME + COMMAND

20727zhangsan20010. 843g9. 2621326344r99. 926.41448: 00 java

20729zhangsan20010. 843g9. 2621326344r99. 926.41448: 00 java

20730zhangsan20010. 843g9. 2621326344r99. 926.41449: 10 java

20728zhangsan20010. 843g9. 2621326344r99. 826.41448: 00 java

20693zhangsan20010. 843g9. 2621326344s0. 026.40: 00.00 java

20727zhangsan20010. 843g9. 2621326344r75. 526.41448: 06 java

20728zhangsan20010. 843g9. 2621326344r75. 226.41448: 06 java

20729zhangsan20010. 843g9. 2621326344r75. 226.41448: 06 java

20730zhangsan20010. 843g9. 2621326344r75. 226.41449: 15 java

20716zhangsan20010. 843g9. 2621326344s24. 926.493: 48.75 java

2. Problem locating

Use the ps command to find the most CPU threads, which is consistent with the top display:

$ Ps-mp20693-oTHREAD, tid, time | sort-rn

Zhangsan18.819 ---- 207301-00:11:23

Zhangsan18.719 ---- 207291-00:10:13

Zhangsan18.719 ---- 207281-00:10:13

Zhangsan18.719 ---- 207271-00:10:13

Zhangsan16.119-futex _-- 2073120: 44: 51

Zhangsan5.219-futex _-- 2073206: 46: 39

Then, with the help of jstack, the GC process is discovered:

"Gangworker #0 (ParallelCMSThreads)" OS _prio = 0tid = 0x00007fb7200d4000nid = 0x50f7runnable

"Gangworker #1 (ParallelCMSThreads)" OS _prio = 0tid = 0x00007fb7200d5800nid = 0x50f8runnable

"Gangworker #2 (ParallelCMSThreads)" OS _prio = 0tid = 0x00007fb7200d7800nid = 0x50f9runnable

"Gangworker #3 (ParallelCMSThreads)" OS _prio = 0tid = 0x00007fb7200d9000nid = 0x50farunnable

Check the GC using jstat. The problem is caused by GC:

$ Jstat-gcutil206931000100

S0s1eomccsygcygctfg?ctgct

0.0099.67100.00100.0098.0894.41421990000.1322708434869.60135238.733

0.0099.67100.00100.0098.0894.41421990000.1322708434870.44835239.580

0.0099.67100.00100.0098.0894.41421990000.1322708434870.44835239.580

0.0099.67100.00100.0098.0894.41421990000.1322708434870.44835239.580

$ Jstat-gccapacity20693

NGCMNNGCMXNGCS0CS1CECOGCMNOGCMXOGCOCMCMNMCMXMCCCSMNCCSMXCCSCYGCFGC

Bytes

$ Jstat-gcold20693

MCMUCCSCCCSUOCOUYGCFGCFGCTGCT

31584.030978.734244153232.78094144.08094144.0421992717434964.10935333.479

$ Jstat-gcoldcapacity20693

Ogcmnogcmxogcocygw.ctgct

3836801168094144.08094144.08094144.0421992719234982.62335351.755

$ Jstat-gcnewcapacity20693

NGCMNNGCMXNGCS0CMXS0CS1CMXS1CECMXECYGCFGC

Bytes

$ Jstat-gc20693

S0CS1CS0US1UECEUOCOUMCMUCCSCCCSUYGCYGCTFGCFGCTGCT

Bytes

$ Jstat-gcnew20693

S0CS1CS0US1UTTMTTDSSECEUYGCYGCT

11072010911070000000000110396.96655360.088608000008860800000421990000.132

When lsof is used, the number of connections of the process is not large. It is completely within the security scope. The problem should be that some objects cannot be recycled. Use jmap to view the memory details. First view the usage of the heap:

$ Jmap-heap20693

AttachingtoprocessID20693, pleasewait...

Debuggerattachedsuccessfully.

Servercompilerdetected.

JVMversionis25.77-b03

Usingparallelthreadsinthenewgeneration.

Usingthread-localobjectallocation.

ConcurrentMark-SweepGC

HeapConfiguration:

MinHeapFreeRatio = 40

MaxHeapFreeRatio = 70

MaxHeapSize = 9422503936 (8986.0 MB)

NewSize = 196411392 (187.3125 MB)

MaxNewSize = 1134100480 (1081.5625 MB)

OldSize = 392888320 (374.6875 MB)

NewRatio = 2

Required vorratio = 8

MetaspaceSize = 21807104 (20.796875 MB)

CompressedClassSpaceSize = 1073741824 (1024.0 MB)

MaxMetaspaceSize = 17592186044415 MB

G1HeapRegionSize = 0 (0.0 MB)

HeapUsage:

NewGeneration (Eden + 1 into vorspace ):

Capacity = 1020723200 (973.4375 MB)

Used = 1020398064 (973.1274261474609 MB)

Free = 325136 (0.3100738525390625 MB)

99.96814650632022% used

EdenSpace:

Capacity = 907345920 (865.3125 MB)

Used = 907345920 (865.3125 MB)

Free = 0 (0.0 MB)

100.0% used

FromSpace:

Capacity = 113377280 (108.125 MB)

Used = 113052144 (107.81492614746094 MB)

Free = 325136 (0.3100738525390625 MB)

99.71322649476156% used

ToSpace:

Capacity = 113377280 (108.125 MB)

Used = 0 (0.0 MB)

Free = 113377280 (108.125 MB)

0.0% used

Concurrentmark-sweepgeneration:

Capacity = 8288403456 (7904.4375 MB)

Used = 8288403424 (7904.437469482422 MB)

Free = 32 (3.0517578125E-5 MB)

99.9999996139184% used

10216internedstringsoccupying942540bytes.

Further view the object information:

$ Jmap-histo20693

Num # instances # bytesclassname

----------------------------------------------

1: 728352122518411456 [B

2: 498271471993085880java. util. TreeMap $ Entry

3: 12855993617087664java. util. TreeMap

4: 4285217445662568org. apache. hadoop. hbase. client. client.pdf

5: 4285222377099536org. apache. hadoop. hbase. client. Scan

6: 4284875377069000org. apache. hadoop. hbase. client. ScannerCallable

7: 4285528342921344 [Ljava. util. HashMap $ Node;

8: 4284880308511360org. apache. hadoop. hbase. client. ScannerCallableWithReplicas

9: 8570671274261472java. util. Collections list

10: 4285579205707792java. util. HashMap

11: 4285283204103584org. apache. hadoop. hbase. client. RpcRetryingCaller

12: 3820914152836560org. apache. hadoop. hbase. filter. SingleColumnValueFilter

13: 4291904137340928java. util. concurrent. ConcurrentHashMap $ Node

14: 8570636137130176java. util. TreeMap $ EntrySet

15: 4285278137128896org. apache. hadoop. hbase. io. TimeRange

16: 8570479137127664java. util. concurrent. atomic. AtomicBoolean

17: 289140992525088org. apache. hadoop. hbase. NoTagsKeyValue

18: 428654068584640java. lang. Integer

19: 428529868564768java. util. TreeMap $ KeySet

20: 428527568564400java. util. TreeSet

21: 428500668560096java. util. HashSet

22: 428485168557616java. util. HashMap $ KeySet

23: 317611850817888org. apache. hadoop. hbase. filter. BinaryComparator

24: 10933607600 [Ljava. util. concurrent. ConcurrentHashMap $ Node;

25: 41877518479112 [Lorg. apache. hadoop. hbase. Cell;

26: 67144317693224 [C

27: 41878426751240org. apache. hadoop. hbase. client. Result

28: 66973916073736java. lang. String

29: 64479615475w.org. apache. hadoop. hbase. filter. SubstringComparator

30: 41913410059216java. util. Configure list $ Node

In order to make the system work normally, we should first implement a solution to temporary and non-temporary problems: monitor GC, restart the hbasethrit2 process on a regular basis, and then find out the root cause to achieve the ultimate goal.

From the output of jmap above, we can see whether it is caused by the failure of the quota. There are two reasons why the consumer is not closed: one is that the client program is not closed due to a problem, that is, memory leakage. The other is that the client program fails to be closed due to exceptions.

Check the source code of the client. If an opentracing exception exists, it is not closed. In addition, if the client is killed or powered off, it cannot be released. This is a problem solved by HBaseThrift2.

3. Solution

Solve the preceding problems:

1) The client ensures that all slaves are released;

2) hbasethrifle T2 increases automatic release of unoperated shards for a long time;

3) You can also use getScannerResults instead of getScannerRows to avoid this problem.

Patch:

Https://issues.apache.org/jira/browse/hbase-17182.

4. Related code

PrivatefinalMap <Integer, ResultScanner scanner> scannerMap

= NewConcurrentHashMap <Integer, resulttables> ();

@ Override

Public?pen=( ByteBuffertable, TScanscan) throwsTIOError, TException {

Tablehtable = getTable (table );

Resultscannerresultsuccess = null;

Try {

Resultresponse = htable. getresponse (scanFromThrift (scan ));

} Catch (io1_tione ){

ThrowgetTIOError (e );

} Finally {

CloseTable (htable );

}

// Add the rule to scannerMap,

// If closeScanner is not called by the client, the memory will be leaked. GC cannot recycle and change part of the memory.

Returnaddtasks (resulttasks );

}

/**

* AssignsauniqueIDtothescannerandaddsthemappingtoaninternalHashMap.

* @ Paramscannertoadd

* @ ReturnIdforthisScanner

*/

PrivateintaddScanner (resultscanner ){

Intid = nextScannerId. getAndIncrement ();

ScannerMap. put (id, callback); // puts the callback in scannerMap

Returnid;

}

/**

* ReturnstheScannerassociatedwiththespecifiedId.

* @ ParamidoftheScannertoget

* @ ReturnaScanner, ornulliftheIdisinvalid

*/

Privateresultscannergetd (intid ){

ReturnscannerMap. get (id );

}

@ Override

Publicvoidcloseent (intscannerId) throwsTIOError, TIllegalArgument, TException {

LOG. debug ("scannerClose: id =" + scannerId );

Resultscanneridentifier = getcounter (scannerId );

If (Bytes = null ){

Stringmessage = "scannerIDisinvalid ";

LOG. warn (message );

TIllegalArgumentex = newTIllegalArgument ();

Ex. setMessage ("InvalidscannerId ");

Throwex;

}

Closing. close (); // close closing

RemoveScanner (scannerId); // remove a role from scannerMap

}

/**

* RemovesthescannerassociatedwiththespecifiedIDfromtheinternalHashMap.

* @ ParamidoftheScannertoremove

* @ ReturntheremovedScanner, or <code> null </code> iftheIdisinvalid

*/

Protectedresultscannerremove.pdf (intid ){

ReturnscannerMap. remove (id); // remove a worker from scannerMap

}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.