Hadoop cluster Namenode (standby), exception hangs problem

Last Update:2018-02-24 Source: Internet

Author: User

Tags memory usage zookeeper

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

2018-02-24

February 22, Discover the Namenode (standby) node of the NAMENODE02 server hangs up and view the Hadoop log/app/hadoop/logs/ Hadoop-appadm-namenode-prd-bldb-hdp-name02.log
Found 2018-02-17 03:29:34, the first reported java.lang.OutOfMemoryError error, the specific error message is as follows

2018-02-17 03:29:34 , 485  ERROR org.apache.hadoop.hdfs.server.namenode.EditLogInputStream:caught exception initializing http://  datanode01:8480/getjournal?jid=cluster1&segmenttxid=2187844&storageinfo=-63%3a1002064722% 3a1516782893469%3acid-02428012-28ec-4c03-b5ba-bfec77c3a32bjava.lang.OutOfMemoryError: Unable to create new native thread at        java.lang.Thread.start0 (native Method) at        Java.lang.Thread.start ( Thread.java:714)

After the 2018-02-17 03:34:34,shutting down standby NN

2018-02-17 03:34:34,495 FATALOrg.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer:Unknown Error encountered whiletailing edits. shutting down standby NN.        java.lang.OutOfMemoryError: Unable to create new native thread at java.lang.Thread.start0 (native Method) At Java.lang.Thread.start (Thread.java:714) at Java.util.concurrent.ThreadPoolExecutor.addWorker (Threadpoolexecutor.java:949) at Java.util.concurrent.ThreadPoolExecutor.execute (Threadpoolexecutor.java:1371) at Com.google.common.util.concurrent.moreexecutors$listeningdecorator.execute (Moreexecutors.java: the) at Com.google.common.util.concurrent.AbstractListeningExecutorService.submit ( Abstractlisteningexecutorservice.java: About) at Org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel.getEditLogManifest (Ipcloggerchannel.java:553) at Org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.getEditLogManifest (Asyncloggerset.java: the) at Org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams ( Quorumjournalmanager.java:474) at Org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams (Journalset.java:278) at Org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams (Fseditlog.java:1590) at Org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams (Fseditlog.java:1614) at Org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits (Editlogtailer.java:216) at Org.apache.hadoop.hdfs.server.namenode.ha.editlogtailer$editlogtailerthread.dowork (EditLogTailer.java:< /c1>342) at org.apache.hadoop.hdfs.server.namenode.ha.editlogtailer$editlogtailerthread.access$ $(Editlogtailer.java:295) at org.apache.hadoop.hdfs.server.namenode.ha.editlogtailer$editlogtailerthread$1. Run (Editlogtailer.java:312) at Org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal (Securityutil.java:455) at Org.apache.hadoop.hdfs.server.namenode.ha.editlogtailer$editlogtailerthread.run (EditLogTailer.java: 308)2018- Geneva- - Geneva: the: the, -INFO org.apache.hadoop.util.ExitUtil:Exiting with status1

The same day check the system memory usage, found that there is indeed insufficient memory, so manually freed the memory, and restarted the NAMENODE02 node.

 free-syncecho3 >/proc/sys/vm/drop_cachesEcho 1 >/proc/sys/vm/drop_caches# is executed on NAMENODE02 node,su - appadmhadoop-daemon. sh start Namenode

To verify that the release is due to the Namenode node remaining out of memory, and the resulting namenode (standby) hangs, the developer adjusts the frequency of the MapReduce job run. In order to simulate the long-running status as soon as possible, pumping a 1-day run once the job changed to 5 minutes to run once.

After running the 2-day job, see the Namenode host's historical memory usage trend graph from the Cloud Platform CAs Monitor as follows

Number 22nd 17:00~18:00 increased the job run frequency, 22nd before 18:20, memory utilization was maintained at around 40%, 18:20~19:10, linear growth to 70%, and maintained at this level, until 23rd 14:42, followed by slow growth, breaking the 80% threshold. At 24th 0:00~1:00 This time period, reached 90% peak.

According to StackOverflow, a description of the problem, the physical memory of the person who raised the question is 12G, which is recommended to set the-XMX value to 3/4 of the physical memory. The Namenode physical memory of our production environment is 8g,datanode physical memory of 125G

Https://stackoverflow.com/questions/9703436/hadoop-heap-space-and-gc-problems

18:43 2018-2-24 Factoring Hadoop production cluster change
executed separately on 5 servers, the following command
vim/app/hadoop/etc/hadoop/hadoop-env.sh
Add the following parameters
Export hadoop_opts= "-xx:+useparallelgc-xmx4g"

In order to facilitate future operation, special record the cluster restart operation procedure

Factoring hadoop/hive/hbase/Zookeeper Cluster Restart operation step ####################################################################1, close hive# in Namenode01, close hiveserver2lsof-I.:9999|grep-V"ID"|awk '{print "kill-9", $ $}'|SH############2, close hbase# in Namenode01, close Hbasestop-hbase.SH############3, close hadoop# in Namnode01, close Hadoopstop-all.SH############4, close zookeeper# executes zkserver on 3 datanode nodes.SHStopzkserver.SHstatus####################################################### #手动释放Linux系统内存SyncEcho 3>/proc/sys/vm/drop_cachesEcho 1>/proc/sys/vm/drop_caches####################################################################5, start zookeeper# execute zkserver on 3 datanode nodes.SHStartzkserver.SHstatus############6, start hadoop# execute start in namenode01-all.SH# Execute in namenode02, restart Namenodehadoop-daemon.SHStop Namenodehadoop-daemon.SHstart namenode# HDFS Namenode01:9000(Active) WEB uihttp://172.31.132.71:50070/# HDFS NAMENODE02:9000(Standby) WEB uihttp://172.31.132.72:50070/# YARN WEB uihttp://172.31.132.71:8088/############7, start hbase# on namenode01 and NAMENODE02 nodes, execute start separately-hbase.SH# Master WEB uihttp://172.31.132.71:60010/# Backup Master WEB uihttp://172.31.132.72:60010/# regionserver WEB uihttp://172.31.132.73:60030/http//172.31.132.74:60030/http//172.31.132.75:60030/############8, start hive# in namenode01, start hiveserver2hive--service Hiveserver2 &# in Datanode01, start metastorehive--service Metastore &# in Namenode01, launch Hwi (web interface) Hive--service Hwi &# HWI WEB uihttp://172.31.132.71:9999/hwi########################################################

Hadoop cluster Namenode (standby), exception hangs problem

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More