1. Background
In the actual project, the input data is often composed of a number of small files, where the small file is smaller than the size of the HDFS system block file (the default 128M), but every file stored in HDFs, directories and blocks are mapped to an object, stored in the Namenode server memory, Typically takes up to 150 bytes. If you have 10 million files, you need to consume approximately 3G of memory space. If it were 1 billion files, it would be unthinkable. So before the project starts, we choose a suitable solution to solve the small file problem of this project.
2. Introduction
The local D:\data directory has a data set of 7 days from 2012-09-17 to 2012-09-23, and we need to merge the 7-day datasets by date into 7 large files uploaded to HDFS
3. Data
All data in the local D:\data directory, as shown in the data address
4. Analysis
Based on the requirements of the project, we complete the following steps
1. Get all the date paths under the D:\data directory, loop through all date paths, and get all txt format file paths through the Globstatus () method.
2, finally through the ioutils.copybytes (in, out, 4096, false) method to merge the data set into a large file, and upload to the HDFS
5. Realize
Custom Regexacceptpathfilter class implementation pathfilter, such as only accept files in txt format in D:\DATA\2012-09-17 date directory
1 /** 2 * @ProjectName Filemerge3 * @PackageName COM.BUAA4 * @ClassName Regexacceptpathfilter5 * @Description accept files in regex format6 * @Author Liu Jishu7 * @Date 2016-04-18 21:58:078 */9 Public Static classRegexacceptpathfilterImplementsPathfilter {Ten Private FinalString regex; One A PublicRegexacceptpathfilter (String regex) { - This. Regex =regex; - } the - @Override - Public BooleanAccept (path path) { - BooleanFlag =path.tostring (). Matches (regex); + returnFlag; - } +}
Implement the main program list method, complete the data set merge, and upload to HDFS
1 /**2 * Merger3 * 4 * @paramSrcpath Source Directory5 * @paramdestpath target directory6 */7 Public Static voidmerge (String srcpath,string destpath) {8 Try{9 //read the configuration of the Hadoop file systemTenConfiguration conf =NewConfiguration (); One A //get remote File system -Uri uri =NewURI (Hdfsuri); -FileSystem remote =Filesystem.get (URI, conf); the - //get local file system -FileSystem local =filesystem.getlocal (conf); - + //get all file paths under the data directory -path[] dirs = fileutil.stat2paths (Local.globstatus (NewPath (Srcpath)); + AFsdataoutputstream out =NULL; atFsdatainputstream in =NULL; - - for(Path dir:dirs) { - //file name -String fileName = Dir.getname (). Replace ("-", ""); - //only accept. txt files in directory infilestatus[] Localstatus = Local.globstatus (NewPath (dir + "/*"),NewRegexacceptpathfilter ("^.*.txt$")); - //get all the files in the directory topath[] Listedpaths =fileutil.stat2paths (localstatus); + //Output Path -Path block =NewPath (DestPath + "/" + FileName + ". txt")); the //Open the output stream *out =remote.create (block); $ for(Path p:listedpaths) {Panax Notoginseng //Open Input Stream -in =Local.open (p); the //Copying Data +Ioutils.copybytes (in, out, 4096,false); A //close the input stream the in.close (); + } - if(Out! =NULL) { $ //turn off the output stream $ out.close (); - } - } the}Catch(Exception e) { -Logger.error ("", e);Wuyi } the}
6, some running code
1 /** 2 * Main Method 3 4@param args5 */6Public staticvoid main (string[] args) {7 Merge ("d:\\data\\*", "/ Buaa "); 8 }
7. Results
If you think reading this blog gives you something to gain, you might want to click " recommend " in the lower right corner.
If you want to find my new blog more easily, click on " Follow me " in the lower left corner.
If you are interested in what my blog is talking about, please keep following my follow-up blog, I am " Liu Chao ★ljc".
This article is copyright to the author and the blog Park, Welcome to reprint, but without the consent of the author must retain this paragraph, and in the article page obvious location to the original link, otherwise reserves the right to pursue legal responsibility.
Address: Download
Hadoop Small file Merge