Optimization of HDFs Small file merging problem: Improvement of Copymerge

Source: Internet
Author: User

1. Problem analysis

Use the fsck command to count the size of the log on one day in HDFs, the block situation, and the average block size, i.e.

[[email protected] jar]$ Hadoop fsck/wcc/da/kafka/report/2015-01-11deprecated:use of this script to execute HDFS CO Mmand is deprecated. Instead Use the HDFs command for IT.15/01/13 18:57:23 WARN util. nativecodeloader:unable to load Native-hadoop library for your platform ... using Builtin-java classes where Applicablecon Necting to Namenode via Http://da-master:50070FSCK started by HDUser (auth:simple) from/172.21.101.30 for Path/wcc/da/ka Fka/report/2015-01-11 at Tue Jan 18:57:24 CST 2015 ... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ............ ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ....... ..... ......................................................................................................................... . Status:healthy Total size:9562516137 B Total dirs:1 total files:240 Total symlinks:0 Total blocks (validated): (avg. block size 36221652 B) minimally replicated blocks:264 (100.0) Over-replicated blocks:0 (0.0%) Under-replicated blocks:0 (0.0%) mis-Replicated blocks:0 (0.0%) Default replication Factor:2 Average block replication:2.0 corrupt blocks:0 Missing replicas:0 (0.0%) Number of data-nodes:5 number of racks:1fsck ended at Tue Jan 18:57:24 CST on millisecondsthe filesystem under Path '/wcc/da/kafka/report/2015-01-11 ' is HEALTHY

Use the form to sort out:

Date time

Total (GB)

Total Blocks

Aveblocksize (MB)

2014/12/21

9.39

268

36

2014/12/20

9.5

268

36

2014/12/19

8.89

268

34

2014/11/5

8.6

266

33

2014/10/1

9.31

268

36


analysis of the existence of the problem: from the table can be seen, the daily log volume of the block situation: a total of about 268 blocks, the average block size of about 36MB, far less than 128MB, this potential to explain a problem. Logs produce a lot of small files, most of them are less than 128M, seriously affect the scalability and performance of the cluster: first, in HDFs, any block, file or directory in memory are stored in the form of objects, each object about 150byte, if there are 1000 0000 small files, Each file occupies a block, then the Namenode requires approximately 2G space. If you store 100 million files, then Namenode needs 20G space, so namenode memory capacity severely restricts the expansion of the cluster, secondly, access to a large number of small files is much less than access to several large files; HDFs was originally developed for streaming access to large files, and if you access a large number of small files, The need to constantly jump from one datanode to another datanode, severely impacting performance; Finally, processing a large number of small files is much less fast than processing large files of equal size, because each small file occupies a slot, Task initiation can take a lot of time, or even most of the time, to start a task and release a task, and the total amount of time accumulated will inevitably increase. Our strategy is to merge the small files first, such as USER_REPORT.TSV,CLIENT_REPORT.TSV,APPLOG_USERDEVICE.TSV the log and then run the job.


2. Solution

Methods that can invoke the API's Fileutil tool class Copymerge (FileSystem srcfs, Path Srcdir, FileSystem dstfs, Path Dstfile, Boolean deletesource, Configuration conf, String addstring);

However, this method does not apply because there are three types of logs in a day's log, namely:


To merge into three files user_report.tsv,client_report.tsv and APPLOG_USERDEVICE.TSV separately, it is necessary to re-implement the Copymerge function, first analyze Copymerge Source:

  /** Copy All files in a directory to one output file (merge). */public static Boolean Copymerge (FileSystem srcfs, Path srcdir, FileSystem dstfs, Pat H Dstfile, Boolean deletesource, Configuration conf, St      Ring addstring) throws IOException {//Generate the merged target file path dstfile, the file name is Srcdir.getname (), which is the directory name of the source path, so here we cannot customize the generated log file name, very inconvenient Dstfile = Checkdest (Srcdir.getname (), Dstfs, Dstfile, false);//Determine if the source path is a file directory if (!srcfs.getfilestatus (Srcdir).   Isdirectory ()) return false;        Create output stream, prepare to write file OutputStream out = Dstfs.create (dstfile);  try {//Get every file under each source path directory filestatus contents[] = Srcfs.liststatus (Srcdir);      Sort operation Arrays.sort (contents); for (int i = 0; i < contents.length; i++) {if (Contents[i].isfile ()) {//Create input stream, read file InputStream in = src          Fs.open (Contents[i].getpath ()); try {//Perform copy operation, write to destination file Ioutils.copybytes (in, Out, conf, false);                          if (addstring!=null) out.write (addstring.getbytes ("UTF-8"));          } finally {in.close ();    }}}} finally {Out.close ();    }//If Deletesource is true, delete each file under the source path directory if (deletesource) {return Srcfs.delete (Srcdir, true);    } else {return true;   }  }

After improvement: (this way only needs to turn off the output stream out three times)

/** Copy corresponding files in a directory to related output file (merge). */@SuppressWarnings ("unchecked") public static Boolean merge (FileSystem HDFs, Path srcdir, path Dstfile,boolean Deletesource, Configuration conf) throws IOException {if (!hdfs.getfilestatus (Srcdir). Isdirectory ()) return false;// Get each file under each source directory; filestatus[] Filestatus = Hdfs.liststatus (Srcdir);//three different types of files the respective file paths are stored in different list;arraylist<path> Userpaths = new arraylist<path> (); arraylist<path> clientpaths = new arraylist<path> (); arraylist<path> apppaths = new arraylist<path> (); for (Filestatus filestatu:filestatus) {Path FilePath = file Statu.getpath (), if (Filepath.getname (). StartsWith ("User_report")) {Userpaths.add (filePath);} else if ( Filepath.getname (). StartsWith ("Client_report")) {Clientpaths.add (filePath);} else if (Filepath.getname (). StartsWith ("Applog_userdevice")) {Apppaths.add (FilePath);}} Write to target file separately: USER_REPORT.TSV if (userpaths.size () > 0) {Path userdstfile = new Path (DSTFile.tostring () + "/USER_REPORT.TSV"); OutputStream out = Hdfs.create (userdstfile); Collections.sort (userpaths); try {iterator<path> Iterator = Userpaths.iterator (); while (Iterator.hasnext ()) { InputStream in = Hdfs.open (Iterator.next ()); try {ioutils.copybytes (in, Out, conf, false);} finally {in.close ();}}} finally {out.close ();}} Write to target file separately: CLIENT_REPORT.TSV if (clientpaths.size () > 0) {Path clientdstfile = new Path (dstfile.tostring () + "/ CLIENT_REPORT.TSV "); OutputStream out = Hdfs.create (clientdstfile); Collections.sort (clientpaths); try {iterator<path> Iterator = Clientpaths.iterator (); while (Iterator.hasnext () {InputStream in = Hdfs.open (Iterator.next ()); try {ioutils.copybytes (in, Out, conf, false);} finally {in.close ();}}} finally {out.close ();}} Write to target file separately: APPLOG_USERDEVICE.TSV if (apppaths.size () > 0) {Path appdstfile = new Path (dstfile.tostring () + "/applog_ USERDEVICE.TSV "); OutputStream out = Hdfs.create (appdstfile); Collections.sort (apppaths); try {iterator<path&Gt iterator = Apppaths.iterator (); while (Iterator.hasnext ()) {InputStream in = Hdfs.open (Iterator.next ()); try { Ioutils.copybytes (in, Out, conf, false);} finally {in.close ();}}} finally {out.close ();}} if (Deletesource) {return Hdfs.delete (Srcdir, True);} return true;}

Of course you can do the same:

public static Boolean mergefiles (FileSystem HDFs, Path Srcdir,path dstfile, Boolean deletesource, Configuration conf) thro WS IOException {if (!hdfs.getfilestatus (Srcdir). Isdirectory ()) return false;//get each file under each source directory; filestatus[] Filestatus = Hdfs.liststatus (Srcdir);//three different types of files are merged for each (Filestatus filestatu:filestatus) {Path FilePath = Filestatu.getpath (); Path Dstpath = new Path (""), if (Filepath.getname (). StartsWith ("User_report")) {Dstpath = new path (dstfile.tostring () + "/ USER_REPORT.TSV ");} else if (Filepath.getname (). StartsWith ("Client_report")) {Dstpath = new Path (dstfile.tostring () + "/CLIENT_REPORT.TSV" );} else if (Filepath.getname (). StartsWith ("Applog_userdevice")) {Dstpath = new Path (dstfile.tostring () + "/client_ REPORT.TSV ");} Else{dstpath=new Path ("/ERROR.TSV");} OutputStream out = Hdfs.create (Dstpath); try {inputstream in = Hdfs.open (FilePath); try {ioutils.copybytes (in, Out, Conf, F Alse);} finally {in.close ();}} finally {out.close ();}} if (Deletesource) {return Hdfs.delete (SRCDIR, True);} return true;}

3. Summarydepending on the requirements of the different business logic, you can customize the implementation of API interface functions. For solving small file merge problems, if you have a better strategy, welcome to communicate!



Optimization of HDFs Small file merging problem: Improvement of Copymerge

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.