Optimization of HDFs Small file merging problem: Improvement of Copymerge

Last Update:2015-01-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Problem analysis

Use the fsck command to count the size of the log on one day in HDFs, the block situation, and the average block size, i.e.

[[email protected] jar]$ Hadoop fsck/wcc/da/kafka/report/2015-01-11deprecated:use of this script to execute HDFS CO Mmand is deprecated. Instead Use the HDFs command for IT.15/01/13 18:57:23 WARN util. nativecodeloader:unable to load Native-hadoop library for your platform ... using Builtin-java classes where Applicablecon Necting to Namenode via Http://da-master:50070FSCK started by HDUser (auth:simple) from/172.21.101.30 for Path/wcc/da/ka Fka/report/2015-01-11 at Tue Jan 18:57:24 CST 2015 ... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ............ ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ..... ....... ..... ......................................................................................................................... . Status:healthy Total size:9562516137 B Total dirs:1 total files:240 Total symlinks:0 Total blocks (validated): (avg. block size 36221652 B) minimally replicated blocks:264 (100.0) Over-replicated blocks:0 (0.0%) Under-replicated blocks:0 (0.0%) mis-Replicated blocks:0 (0.0%) Default replication Factor:2 Average block replication:2.0 corrupt blocks:0 Missing replicas:0 (0.0%) Number of data-nodes:5 number of racks:1fsck ended at Tue Jan 18:57:24 CST on millisecondsthe filesystem under Path '/wcc/da/kafka/report/2015-01-11 ' is HEALTHY

Use the form to sort out:

Date time	Total (GB)	Total Blocks	Aveblocksize (MB)
2014/12/21	9.39	268	36
2014/12/20	9.5	268	36
2014/12/19	8.89	268	34
2014/11/5	8.6	266	33
2014/10/1	9.31	268	36

analysis of the existence of the problem: from the table can be seen, the daily log volume of the block situation: a total of about 268 blocks, the average block size of about 36MB, far less than 128MB, this potential to explain a problem. Logs produce a lot of small files, most of them are less than 128M, seriously affect the scalability and performance of the cluster: first, in HDFs, any block, file or directory in memory are stored in the form of objects, each object about 150byte, if there are 1000 0000 small files, Each file occupies a block, then the Namenode requires approximately 2G space. If you store 100 million files, then Namenode needs 20G space, so namenode memory capacity severely restricts the expansion of the cluster, secondly, access to a large number of small files is much less than access to several large files; HDFs was originally developed for streaming access to large files, and if you access a large number of small files, The need to constantly jump from one datanode to another datanode, severely impacting performance; Finally, processing a large number of small files is much less fast than processing large files of equal size, because each small file occupies a slot, Task initiation can take a lot of time, or even most of the time, to start a task and release a task, and the total amount of time accumulated will inevitably increase. Our strategy is to merge the small files first, such as USER_REPORT.TSV,CLIENT_REPORT.TSV,APPLOG_USERDEVICE.TSV the log and then run the job.

2. Solution

Methods that can invoke the API's Fileutil tool class Copymerge (FileSystem srcfs, Path Srcdir, FileSystem dstfs, Path Dstfile, Boolean deletesource, Configuration conf, String addstring);

However, this method does not apply because there are three types of logs in a day's log, namely:

To merge into three files user_report.tsv,client_report.tsv and APPLOG_USERDEVICE.TSV separately, it is necessary to re-implement the Copymerge function, first analyze Copymerge Source:

  /** Copy All files in a directory to one output file (merge). */public static Boolean Copymerge (FileSystem srcfs, Path srcdir, FileSystem dstfs, Pat H Dstfile, Boolean deletesource, Configuration conf, St      Ring addstring) throws IOException {//Generate the merged target file path dstfile, the file name is Srcdir.getname (), which is the directory name of the source path, so here we cannot customize the generated log file name, very inconvenient Dstfile = Checkdest (Srcdir.getname (), Dstfs, Dstfile, false);//Determine if the source path is a file directory if (!srcfs.getfilestatus (Srcdir).   Isdirectory ()) return false;        Create output stream, prepare to write file OutputStream out = Dstfs.create (dstfile);  try {//Get every file under each source path directory filestatus contents[] = Srcfs.liststatus (Srcdir);      Sort operation Arrays.sort (contents); for (int i = 0; i < contents.length; i++) {if (Contents[i].isfile ()) {//Create input stream, read file InputStream in = src          Fs.open (Contents[i].getpath ()); try {//Perform copy operation, write to destination file Ioutils.copybytes (in, Out, conf, false);                          if (addstring!=null) out.write (addstring.getbytes ("UTF-8"));          } finally {in.close ();    }}}} finally {Out.close ();    }//If Deletesource is true, delete each file under the source path directory if (deletesource) {return Srcfs.delete (Srcdir, true);    } else {return true;   }  }

After improvement: (this way only needs to turn off the output stream out three times)

/** Copy corresponding files in a directory to related output file (merge). */@SuppressWarnings ("unchecked") public static Boolean merge (FileSystem HDFs, Path srcdir, path Dstfile,boolean Deletesource, Configuration conf) throws IOException {if (!hdfs.getfilestatus (Srcdir). Isdirectory ()) return false;// Get each file under each source directory; filestatus[] Filestatus = Hdfs.liststatus (Srcdir);//three different types of files the respective file paths are stored in different list;arraylist<path> Userpaths = new arraylist<path> (); arraylist<path> clientpaths = new arraylist<path> (); arraylist<path> apppaths = new arraylist<path> (); for (Filestatus filestatu:filestatus) {Path FilePath = file Statu.getpath (), if (Filepath.getname (). StartsWith ("User_report")) {Userpaths.add (filePath);} else if ( Filepath.getname (). StartsWith ("Client_report")) {Clientpaths.add (filePath);} else if (Filepath.getname (). StartsWith ("Applog_userdevice")) {Apppaths.add (FilePath);}} Write to target file separately: USER_REPORT.TSV if (userpaths.size () > 0) {Path userdstfile = new Path (DSTFile.tostring () + "/USER_REPORT.TSV"); OutputStream out = Hdfs.create (userdstfile); Collections.sort (userpaths); try {iterator<path> Iterator = Userpaths.iterator (); while (Iterator.hasnext ()) { InputStream in = Hdfs.open (Iterator.next ()); try {ioutils.copybytes (in, Out, conf, false);} finally {in.close ();}}} finally {out.close ();}} Write to target file separately: CLIENT_REPORT.TSV if (clientpaths.size () > 0) {Path clientdstfile = new Path (dstfile.tostring () + "/ CLIENT_REPORT.TSV "); OutputStream out = Hdfs.create (clientdstfile); Collections.sort (clientpaths); try {iterator<path> Iterator = Clientpaths.iterator (); while (Iterator.hasnext () {InputStream in = Hdfs.open (Iterator.next ()); try {ioutils.copybytes (in, Out, conf, false);} finally {in.close ();}}} finally {out.close ();}} Write to target file separately: APPLOG_USERDEVICE.TSV if (apppaths.size () > 0) {Path appdstfile = new Path (dstfile.tostring () + "/applog_ USERDEVICE.TSV "); OutputStream out = Hdfs.create (appdstfile); Collections.sort (apppaths); try {iterator<path&Gt iterator = Apppaths.iterator (); while (Iterator.hasnext ()) {InputStream in = Hdfs.open (Iterator.next ()); try { Ioutils.copybytes (in, Out, conf, false);} finally {in.close ();}}} finally {out.close ();}} if (Deletesource) {return Hdfs.delete (Srcdir, True);} return true;}

Of course you can do the same:

public static Boolean mergefiles (FileSystem HDFs, Path Srcdir,path dstfile, Boolean deletesource, Configuration conf) thro WS IOException {if (!hdfs.getfilestatus (Srcdir). Isdirectory ()) return false;//get each file under each source directory; filestatus[] Filestatus = Hdfs.liststatus (Srcdir);//three different types of files are merged for each (Filestatus filestatu:filestatus) {Path FilePath = Filestatu.getpath (); Path Dstpath = new Path (""), if (Filepath.getname (). StartsWith ("User_report")) {Dstpath = new path (dstfile.tostring () + "/ USER_REPORT.TSV ");} else if (Filepath.getname (). StartsWith ("Client_report")) {Dstpath = new Path (dstfile.tostring () + "/CLIENT_REPORT.TSV" );} else if (Filepath.getname (). StartsWith ("Applog_userdevice")) {Dstpath = new Path (dstfile.tostring () + "/client_ REPORT.TSV ");} Else{dstpath=new Path ("/ERROR.TSV");} OutputStream out = Hdfs.create (Dstpath); try {inputstream in = Hdfs.open (FilePath); try {ioutils.copybytes (in, Out, Conf, F Alse);} finally {in.close ();}} finally {out.close ();}} if (Deletesource) {return Hdfs.delete (SRCDIR, True);} return true;}

3. Summarydepending on the requirements of the different business logic, you can customize the implementation of API interface functions. For solving small file merge problems, if you have a better strategy, welcome to communicate!

Optimization of HDFs Small file merging problem: Improvement of Copymerge

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Optimization of HDFs Small file merging problem: Improvement of Copymerge

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Optimization of HDFs Small file merging problem: Improvement of Copymerge

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support