In the use of flume found due to network, HDFs and other reasons, so that after the flume collected to the HDFs log some anomalies, performance as:
1. Files that have not been closed: Files ending with tmp (default). Added to the HDFs file should be a GZ compressed file, the file with the end of TMP can not be used;
2, there is a size of 0 files, such as GZ compressed file size of 0, we take this file alone decompression found is infinite loop compression ... This can't be used directly to run MapReduce.
At present, the above two cases are found, others have not yet found. As for the above situation is not clear why, and both of these conditions will affect the normal implementation of Hive, MapReduce, 2 words directly failed,1 may lose the corresponding data.
For 2 directly delete the line, 1 of the situation we found that directly remove the TMP suffix is OK. In order to write a shell script, timed to check the HDFs file Discovery 1 removed the TMP suffix, found that 2 deleted files, the script is as follows:
1#!/bin/SH2 3CD 'dirname$0`4 5 Date=`Date-D"1 day ago"+%y/%m/%d '6 Echo "date is ${date}"7hadoop_home=/usr/lib/hadoop-0.20-mapreduce/8Datadir=/data/*/9 echo "dir is ${datadir}"Ten echo "Check HDFs file is Crrect?" One A ifs=$ ' \ n '; for name in ' ${hadoop_home}/bin/hadoop fs-ls ${datadir}${date} ' - Do - size= ' echo ' ${name} ' | awk ' {print $} ' the fileallname= ' echo ' ${name} ' | awk ' {print $8} ' - filenamenotmp= ' echo ${fileallname%.tmp*} ' - tmp= ' echo ${fileallname#*.gz} ' - if ["${size}" = = "0"];then + echo "${fileallname} ' s size is ${size} ..... delete it!" - ${hadoop_home}/bin/hadoop FS-RMR ${fileallname} + fi A if ["${tmp}" = = ". tmp"];then at ${hadoop_home}/bin/hadoop fs-mv ${fileallname} ${filenamenotmp} - echo "${fileallname} have changed to ${filenamenotmp} ..." - fi - Done
Note: The above ground 8 lines, HDFS support regular. The above HDFs directory is:/data/*/2014/12/08 So, we can change according to their own needs
You can use crontab to check it regularly.
Shell script monitors flume output to HDFs file legitimacy