Hadoop cluster routine O & M, hadoop Cluster
(1) backup namenode metadata
Metadata in namenode is very important. If metadata is lost or damaged, the entire system cannot be used. Therefore, metadata should be backed up frequently, preferably remotely.
1. Copy metadata to a remote site
(1) The following code copies the metadata in secondary namenode to the directory named at a time, and then remotely sends the metadata to other machines using the scp command.
#!/bin/bashexport dirname=/mnt/tmphadoop/dfs/namesecondary/current/`date +%y%m%d%H`if [ ! -d ${dirname} ]thenmkdir ${dirname}cp /mnt/tmphadoop/dfs/namesecondary/current/* ${dirname}fiscp -r ${dirname} slave1:/mnt/namenode_backup/rm -r ${dirname}(2) Configure crontab and regularly execute this task.
0, *** bash/mnt/scripts/namenode_backup_script.sh
2. Start a local namenode daemon on the remote site and try to load the backup files to check whether the backup has been properly performed.
(2) Data Backup
Important data must be backed up instead of relying entirely on HDFS. Note the following:
(1) remote backup as far as possible
(2) If you use distcp to back up data to another hdfs cluster, do not use hadoop of the same version to avoid data errors caused by hadoop itself.
(3) File System check
Run the fsck tool of HDFS on a regular basis throughout the file system to actively search for lost or damaged blocks.
We recommend that you run the command once a day.
[Jediael @ master ~] $ Hadoop fsck /...... Omitted output (if there is an error, it will appear in addition; otherwise, only a dot will appear, and a dot will represent a file )............... Status: HEALTHY Total size: 14466494870 B Total dirs: 502 Total files: 1592 (Files currently being written: 2) Total blocks (validated): 1725 (avg. block size 8386373 B) Minimally replicated blocks: 1725 (100.0%) Over-replicated blocks: 0 (0.0%) Under-replicated blocks: 648 (37.565216%) Mis-replicated blocks: 0 (0.0%) default replication factor: 2 Average block replication: 2.0 slave upt blocks: 0 Missing replicas: 760 (22.028986%) Number of data-nodes: 2 Number of racks: 1 FSCK ended at Sun Mar 01 20:17:57 CST 2015 in 608 millisecondsThe filesystem under path '/' is HEALTHY
(1) If dfs. replication in the hdfs-site.xml is set to 3 and there are only two datanode implementations, the following error occurs when fsck is executed;
/Hbase/mar01__webpage/logs/il/logs: Under replicated blk _-4711857142889323098_6221. Target Replicas is 3 but found 2 replica (s ).
Note that the original dfs. replication is 3, and then a datanode is deprecated and the dfs. replication is changed to 2, but the existing files also record dfs. if replication is 3, the preceding error occurs, and the Under-replicated blocks: 648 (37.565216%) error occurs ).
(2) The fsck tool can also be used to check which blocks are included in a file and where the blocks are located.
[jediael@master conf]$ hadoop fsck /hbase/Feb2621_webpage/c23aa183c7cb86af27f15d4c2aee2795/s/30bee5fb620b4cd184412c69f70d24a7 -files -blocks -racksFSCK started by jediael from /10.171.29.191 for path /hbase/Feb2621_webpage/c23aa183c7cb86af27f15d4c2aee2795/s/30bee5fb620b4cd184412c69f70d24a7 at Sun Mar 01 20:39:35 CST 2015/hbase/Feb2621_webpage/c23aa183c7cb86af27f15d4c2aee2795/s/30bee5fb620b4cd184412c69f70d24a7 21507169 bytes, 1 block(s): Under replicated blk_7117944555454804881_3655. Target Replicas is 3 but found 2 replica(s).0. blk_7117944555454804881_3655 len=21507169 repl=2 [/default-rack/10.171.94.155:50010, /default-rack/10.251.0.197:50010]Status: HEALTHY Total size: 21507169 B Total dirs: 0 Total files: 1 Total blocks (validated): 1 (avg. block size 21507169 B) Minimally replicated blocks: 1 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 1 (100.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 2 Average block replication: 2.0 Corrupt blocks: 0 Missing replicas: 1 (50.0 %) Number of data-nodes: 2 Number of racks: 1FSCK ended at Sun Mar 01 20:39:35 CST 2015 in 0 millisecondsThe filesystem under path '/hbase/Feb2621_webpage/c23aa183c7cb86af27f15d4c2aee2795/s/30bee5fb620b4cd184412c69f70d24a7' is HEALTHY
The command is used as follows:
[jediael@master ~]$ hadoop fsck -filesUsage: DFSck <path> [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]] <path> start checking from this path -move move corrupted files to /lost+found -delete delete corrupted files -files print out files being checked -openforwrite print out files opened for write -blocks print out block report -locations print out locations for every block -racks print out network topology for data-node locations By default fsck ignores files opened for write, use -openforwrite to report such files. They are usually tagged CORRUPT or HEALTHY depending on their block allocation statusGeneric options supported are-conf <configuration file> specify an application configuration file-D <property=value> use value for given property-fs <local|namenode:port> specify a namenode-jt <local|jobtracker:port> specify a job tracker-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.The general command line syntax isbin/hadoop command [genericOptions] [commandOptions]
For more information, see hadoop authoritative guide P376.
(4) balancer
Over time, the block distribution on each datanode is becoming increasingly unbalanced, which will reduce the local MR performance and make some datanode relatively busy.
The balancer is a hadoop daemon. It moves blocks from the busy DN to the idle DN. At the same time, it sticks to the block copy placement policy and distributes the copies to different machines and racks.
We recommend that you periodically execute the balancer, such as daily or weekly.
(1) run the following command
[jediael@master log]$ start-balancer.shstarting balancer, logging to /var/log/hadoop/hadoop-jediael-balancer-master.out
View the log as follows:
[jediael@master hadoop]$ pwd/var/log/hadoop[jediael@master hadoop]$ lshadoop-jediael-balancer-master.log hadoop-jediael-balancer-master.out[jediael@master hadoop]$ cat hadoop-jediael-balancer-master.log2015-03-01 21:08:08,027 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/10.251.0.197:500102015-03-01 21:08:08,028 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/10.171.94.155:500102015-03-01 21:08:08,028 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: 0 over utilized nodes:2015-03-01 21:08:08,028 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: 0 under utilized nodes:
(2) The balancer matches the usage of each DN with that of the entire cluster. This "closeness" is specified through the-threashold parameter, and the default value is 10%.
(3) the bandwidth for copying data between different nodes is limited, the default is 1 MB/s, can be specified through the dfs. balance. bandwithPerSec attribute in the hdfs-site.xml file (in bytes ).
(5) datanode block Scanner
Each datanode runs a block scanner, which periodically detects all blocks on the node. If an error (such as a check or error) is found, the namenode is notified, then, the namenode initiates the data to re-create a copy or repair it.
The scan cycle is specified by dfs. datanode. scan. period. hours. The default scan cycle is three weeks (504 hours ).
View scan information at the address below:
(1) http: // datanote: 50075/blockScannerReport
List the overall Detection Status
Total Blocks : 1919Verified in last hour : 4Verified in last day : 170Verified in last week : 535Verified in last four weeks : 535Verified in SCAN_PERIOD : 535Not yet verified : 1384Verified since restart : 559Scans since restart : 91Scan errors since restart : 0Transient scan errors : 0Current scan rate limit KBps : 1024Progress this period : 113%Time left in cur period : 97.14%
(2) http: // 123.56.92.95: 50075/blockScannerReport? Listblocks
List all blocks and the latest verification status
Blk_8482244195562050998_3796: status: OK type: none scan time: 0 not yet verified
Blk_3985450615149803606_7952: status: OK type: none scan time: 0 not yet verified
The above is not verified. For more information about each field, see P379.