On-line websites generate log data every day. If there is a requirement: Log files generated by the day before the start of the operation at 24 o'clock in the morning are expected to be uploaded to the HDFS cluster in real time.
How to do it? Can I implement a recurring upload requirement after implementation? How to schedule?
Linux crontab::
Crontab-e
0 0 * * */shell/uploadfile2hdfs.sh//Daily 12:00
Implementation process
The logic generated by the general log file is determined by the business system, such as scrolling once per hour, or a certain size scroll once, to avoid a single log file too large inconvenient operation.
For example, the scrolling file is named access.log.x, where x is a number. The log file that is being written is called Access.log. In this case, if the log file suffix is a number such as 1\2\3, then the file to meet the requirements can be uploaded, the file is moved to the ready to upload the working range directory. After the workspace has files, you can use the Hadoop put command to upload the files.
Create a directory on the server
#日志文件存放的目录mkdir-R/root/logs/log/#待上传文件存放的目录mkdir-R/root/logs/toupload/
Writing shell scripts
VI uploadfile2hdfs.sh
#!/bin/bash#set java envexport java_home=/export/servers/jdk1.8.0_65export jre_home=${java_home}/jreexport Classpath=.:${java_home}/lib:${jre_home}/libexport path=${java_home}/bin: $PATH #set Hadoop envexport HADOOP_HOME=/ Export/servers/hadoop-2.7.4export path=${hadoop_home}/bin:${hadoop_home}/sbin: $PATH # directory where log files are stored log_src_dir=/root/ logs/log/#待上传文件存放的目录log_toupload_dir =/root/logs/toupload/#日志文件上传到hdfs的根路径date1 = ' date-d last-day +%y_%m_%d ' Hdfs_ root_dir=/data/clicklog/$date 1/#打印环境变量信息echo "Envs:hadoop_home: $HADOOP _home" #读取日志文件的目录 to determine if there are files that need to be uploaded echo "log_src _dir: "$log _src_dirls $log _src_dir | While read Filenamedo if [["$fileName" = = access.log.*]]; Then # if ["Access.log" = "$fileName"];then date= ' date +%y_%m_%d_%h_%m_%s ' #将文件移动到待上传目录并重命名 #打印信 echo "Moving $log _src_dir$filename to $log _toupload_dir" Xxxxx_click_log_$filename "$date" MV $log _src_dir$f Ilename $log _toupload_dir "Xxxxx_click_log_$filename" $date #将待上传的文件path写入一个列表文件willDoing echo $log _toupload_dir "Xxxxx_click_log_$filename" $date >> $log _toupload_dir "willdoing." $date fi done# Find list files Willdoingls $log _toupload_dir | grep would |grep-v "_copy_" | Grep-v "_done_" | While read Linedo #打印信息 echo "Toupload are in file:" $line #将待上传文件列表willDoing改名为willDoing_COPY_ MV $log _toupload _dir$line $log _toupload_dir$line "_copy_" #读列表文件willDoing_COPY_的内容 (one to upload file name), where line is the path cat for the list of files to be uploaded $log _toupload_dir$line "_copy_" |while read line do #打印信息 echo "puting ... $line to HDFs path ... $hdfs _root_dir "Hadoop fs-mkdir-p $hdfs _root_dir hadoop fs-put $line $hdfs _root_dir done MV $log _toupload_dir $line "_copy_" $log _toupload_dir$line "_done_" done
Set execution permissions
chmod 777 uploadfile2hdfs.sh
To add a test file execution script in/root/logs/log/
./uploadfile2hdfs.sh
Viewing phenomena in the WebUI of/root/logs/toupload/and HDFs
The shell collects data to hdfs at timed intervals