Hadoop cluster Namenode (standby), exception hangs problem

Source: Internet
Author: User
Tags memory usage zookeeper

2018-02-24

February 22, Discover the Namenode (standby) node of the NAMENODE02 server hangs up and view the Hadoop log/app/hadoop/logs/ Hadoop-appadm-namenode-prd-bldb-hdp-name02.log
Found 2018-02-17 03:29:34, the first reported java.lang.OutOfMemoryError error, the specific error message is as follows

2018-02-17 03:29:34 , 485  ERROR org.apache.hadoop.hdfs.server.namenode.EditLogInputStream:caught exception initializing http://  datanode01:8480/getjournal?jid=cluster1&segmenttxid=2187844&storageinfo=-63%3a1002064722% 3a1516782893469%3acid-02428012-28ec-4c03-b5ba-bfec77c3a32bjava.lang.OutOfMemoryError: Unable to create new native thread at        java.lang.Thread.start0 (native Method) at        Java.lang.Thread.start ( Thread.java:714)

After the 2018-02-17 03:34:34,shutting down standby NN

2018-02-17 03:34:34,495 FATALOrg.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer:Unknown Error encountered whiletailing edits. shutting down standby NN.        java.lang.OutOfMemoryError: Unable to create new native thread at java.lang.Thread.start0 (native Method) At Java.lang.Thread.start (Thread.java:714) at Java.util.concurrent.ThreadPoolExecutor.addWorker (Threadpoolexecutor.java:949) at Java.util.concurrent.ThreadPoolExecutor.execute (Threadpoolexecutor.java:1371) at Com.google.common.util.concurrent.moreexecutors$listeningdecorator.execute (Moreexecutors.java: the) at Com.google.common.util.concurrent.AbstractListeningExecutorService.submit ( Abstractlisteningexecutorservice.java: About) at Org.apache.hadoop.hdfs.qjournal.client.IPCLoggerChannel.getEditLogManifest (Ipcloggerchannel.java:553) at Org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.getEditLogManifest (Asyncloggerset.java: the) at Org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams ( Quorumjournalmanager.java:474) at Org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams (Journalset.java:278) at Org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams (Fseditlog.java:1590) at Org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams (Fseditlog.java:1614) at Org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits (Editlogtailer.java:216) at Org.apache.hadoop.hdfs.server.namenode.ha.editlogtailer$editlogtailerthread.dowork (EditLogTailer.java:< /c1>342) at org.apache.hadoop.hdfs.server.namenode.ha.editlogtailer$editlogtailerthread.access$ $(Editlogtailer.java:295) at org.apache.hadoop.hdfs.server.namenode.ha.editlogtailer$editlogtailerthread$1. Run (Editlogtailer.java:312) at Org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal (Securityutil.java:455) at Org.apache.hadoop.hdfs.server.namenode.ha.editlogtailer$editlogtailerthread.run (EditLogTailer.java: 308)2018- Geneva- - Geneva: the: the, -INFO org.apache.hadoop.util.ExitUtil:Exiting with status1

The same day check the system memory usage, found that there is indeed insufficient memory, so manually freed the memory, and restarted the NAMENODE02 node.

 free-syncecho3 >/proc/sys/vm/drop_cachesEcho 1 >/proc/sys/vm/drop_caches# is executed on NAMENODE02 node,su - appadmhadoop-daemon. sh start Namenode

To verify that the release is due to the Namenode node remaining out of memory, and the resulting namenode (standby) hangs, the developer adjusts the frequency of the MapReduce job run. In order to simulate the long-running status as soon as possible, pumping a 1-day run once the job changed to 5 minutes to run once.

After running the 2-day job, see the Namenode host's historical memory usage trend graph from the Cloud Platform CAs Monitor as follows

Number 22nd 17:00~18:00 increased the job run frequency, 22nd before 18:20, memory utilization was maintained at around 40%, 18:20~19:10, linear growth to 70%, and maintained at this level, until 23rd 14:42, followed by slow growth, breaking the 80% threshold. At 24th 0:00~1:00 This time period, reached 90% peak.

 

According to StackOverflow, a description of the problem, the physical memory of the person who raised the question is 12G, which is recommended to set the-XMX value to 3/4 of the physical memory. The Namenode physical memory of our production environment is 8g,datanode physical memory of 125G

Https://stackoverflow.com/questions/9703436/hadoop-heap-space-and-gc-problems

18:43 2018-2-24 Factoring Hadoop production cluster change
executed separately on 5 servers, the following command
vim/app/hadoop/etc/hadoop/hadoop-env.sh
Add the following parameters
Export hadoop_opts= "-xx:+useparallelgc-xmx4g"

In order to facilitate future operation, special record the cluster restart operation procedure

Factoring hadoop/hive/hbase/Zookeeper Cluster Restart operation step ####################################################################1, close hive# in Namenode01, close hiveserver2lsof-I.:9999|grep-V"ID"|awk '{print "kill-9", $ $}'|SH############2, close hbase# in Namenode01, close Hbasestop-hbase.SH############3, close hadoop# in Namnode01, close Hadoopstop-all.SH############4, close zookeeper# executes zkserver on 3 datanode nodes.SHStopzkserver.SHstatus####################################################### #手动释放Linux系统内存SyncEcho 3>/proc/sys/vm/drop_cachesEcho 1>/proc/sys/vm/drop_caches####################################################################5, start zookeeper# execute zkserver on 3 datanode nodes.SHStartzkserver.SHstatus############6, start hadoop# execute start in namenode01-all.SH# Execute in namenode02, restart Namenodehadoop-daemon.SHStop Namenodehadoop-daemon.SHstart namenode# HDFS Namenode01:9000(Active) WEB uihttp://172.31.132.71:50070/# HDFS NAMENODE02:9000(Standby) WEB uihttp://172.31.132.72:50070/# YARN WEB uihttp://172.31.132.71:8088/############7, start hbase# on namenode01 and NAMENODE02 nodes, execute start separately-hbase.SH# Master WEB uihttp://172.31.132.71:60010/# Backup Master WEB uihttp://172.31.132.72:60010/# regionserver WEB uihttp://172.31.132.73:60030/http//172.31.132.74:60030/http//172.31.132.75:60030/############8, start hive# in namenode01, start hiveserver2hive--service Hiveserver2 &# in Datanode01, start metastorehive--service Metastore &# in Namenode01, launch Hwi (web interface) Hive--service Hwi &# HWI WEB uihttp://172.31.132.71:9999/hwi########################################################

Hadoop cluster Namenode (standby), exception hangs problem

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.