Datanode a script solution for node failure caused by single block disk failure

Source: Internet
Author: User
Tags chmod config continue manual log root directory


In the 1.2.0 of Hadoop, because a single disk failure causes the Datanode node to fail, most of the datanode nodes in the production environment have more than one disk, and we now need a way for datanode to fail the entire node with a failure to block the disk.

Solution and applicable scenario:

1, modify the Hadoop source code (in addition to the author's ability)

2, modify the value of the in the Hdfs-site.xml, remove the mount point of the failed disk and restart (recommended in the manual deployment of Hadoop environment, this method as a script to use)

3, uninstall the fault disk, temporarily write the data to the root of the mount point, do not modify the configuration file (the author's environment, using Hortonworks deployment of Hadoop, configuration files by Ambari management, manual modification once the use of Ambari will be synchronized back to the original configuration ...)

Part of the requirements and code implementation of the third solution


1, detect the failure of the disk mount point three consecutive failures and then uninstall and stop Datanode to avoid wrong judgment

2, after unloading the failure disk, detect the mount point normal start Datanode

3, in the Datanode boot, can still be the failure of the disk Nagios alarm to remind the administrator

This column more highlights: http://www.bianceng.cn

4, after the administrator repair Disk, only need to manually stop Datanode, and mount the repaired disk, the script will automatically detect the disk mount point and start Datanode

5, assuming that the system disk has a certain usable capacity, and has been good, because when the system disk failure ... No, then ...

Code implementation:

#!/bin/bash # Datanode mount point list mpt[0]=/storage/disk1/mpt[1]=/storage/disk2/# check_readonly This script is used to detect if the mount point of the Datanode is OK, Errors must be output: the error DIRECTORY keyword Crd=/opt/nagios-bin/check_readonly # is a Nagios passive monitoring script used to send local disk information to Nagios rws=/opt/
nagios-bin/ dpid= ' cat/var/run/hadoop/hdfs/ ' kd= ' ps aux|grep-v grep|grep $DPID ' Hostname= ' uname-n ' lip= ' grep $HOSTNAME/etc/hosts|awk ' {print} ' Ctf=/tmp/count.log [!-f ' $CTF '] && Echo 0 > "$CTF for I in ${mpt[@]};d o cte= ' $CRD $i |grep ' ERROR DIRECTORY ' nmp= ' df $i |/usr/bin/tail-1|awk ' {print $NF
        } ' # to determine if Datanode needs to be started, and to start if [[Z ' $CTE ' &&-Z ' $KD ']];then chmod 777 $i &>/dev/null  RM-RF "$i"/* &>/dev/null su-hdfs-c "/usr/lib/hadoop/bin/ Start Datanode "&>/dev/null fi" if the mount point is normal, determine if you want to modify the script information sent to the Nagios server to normal if [[z] $CTE "&&" $N MP "!="/"]];then if grep" Chkrw_state=2 "$RWS &>/dev/null then Sed-i '/chkrw_state=2/d ' $RWS fi continue fi When the mount point fails or mounts to the root directory, enter the following judgment if [[n] $CTE | | "$NMP" = "/"]];then dnp= ' ps aux|grep-v grep|grep datanode|/usr/bin/wc-l ' # when mount point is normal, datanode process is normal, and mount point is root , jump out of loop [[z "$CTE" && "$NMP" = = "/" &&-n $DNP]] "&& Continue # Counter, go to this step +1, meet three times Then enter the following judgment/usr/bin/expr ' cat $CTF ' + 1 > $CTF if [' Cat $CTF '-ge 3];then echo 0 > ' $CTF "# Clear 0 counters, and modify the script if you want Nagios to send information!
            grep "chkrw_state=2" $RWS &>/dev/null then Sed-i '/chkrw_state=$?/achkrw_state=2 ' $RWS
            Fi # Stop the process that may occupy the fault mount point and stop Datanode and uninstall pidl= ('/usr/sbin/lsof $i |awk ' {print $} ' |grep-v PID ')
            For P in ${pidl[@]};d o [-Z "$p"] && continue kill-9 $p done /bin/umount $i &>/dev/null chmod-r 777 $i if [[' $NMP ' = = '/']];then su-hdfs-c '/usr/lib/hadoop/bin/hado Stop Datanode "&>/dev/null [-N" $KD] && kill-9 $DPID &>/dev/null Else echo "'/bin/date +%k:%m:%s/%y-%m-%d ' $i umount fail." >> /tmp/check_mp.log fi fi Done # send local disk information to Nagios server $RWS $LIP ${mpt[@]}


Nagios aspects of the content is not the focus of this article, so no longer provide the corresponding script appears in the text, if necessary, please contact the author.

Also consider using the root directory for a period of time after the failed disk is repaired or replaced, after mounting the data migration problem, the author tries to synchronize the data of the system disk mount point to the new disk (click here to view this version), if it is feasible to complete in a short time. But when the amount of data (such as the need to sync 500G or even 1T), the short time must not be completed, and after synchronization to start Datanode, through the log found Datanode error data node registration location inconsistencies (detailed see below), but the Datanode process will not be suspended (in fact , do not do anything, just stop datanode for a while, then start to show this error), consider the problem of this method can not determine its consequences for the time being, so I in this article and production environment in the use of Mount point when the contents of the first empty, Therefore, the log can be seen after the start of the HDFs in a large number of data synchronization operations.

Write the script to the scheduled task as often as needed.

Write the text is not easy, not the place, welcome to correct, if think still can do, give a praise AH pro.

This article from the "Self-Improvement" blog, please be sure to retain this source

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.