Input inputformat -- sequencefileinputformat

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Inheritance: sequencefileinputformat extends fileinputformat implements inputformat.

SequencefileinputformatCodeAs follows (actually very simple ):

   /**  * The fileinputformat method is overwritten. The filestatus [] * length obtained by fileinputformat is the length of the map to be run. Each filestatus corresponds to a file.  */  @ Override  Protected Filestatus [] liststatus (jobconf job) Throws  Ioexception {filestatus [] files = Super  . Liststatus (job ); /*  Call the liststatus method of the parent class, and perform processing on your own. traverse the obtained filestatus []. When a folder is encountered, check whether it is a mapfile, if yes, remove the data files that are also sequencefile; otherwise, filter out the folders.  */      For ( Int I = 0; I <files. length; I ++ ) {Filestatus File = Files [I];  If (File. isdir ()){ //  It's a mapfile Path datafile = New PATH (file. getpath (), mapfile. data_file_name); filesystem FS = File. getpath (). getfilesystem (job );  //  Use the data file Files [I] = FS. getfilestatus (datafile );}}  Return  Files ;}

Let's take a look at the liststatus (jobconf job) method of fileinputformat:

  Protected Filestatus [] liststatus (jobconf job) Throws  Ioexception {  // All input paths in the job configuration are separated by commas. Path [] dirs = Getinputpaths (job );  If (Dirs. Length = 0 ){  Throw   New Ioexception ("no input paths specified in job" );}  //  Get tokens for all the required filesystems ..  Tokencache. obtaintokensfornamenodes (job. getcredentials (), dirs, job); List <Filestatus> result = New Arraylist <filestatus>(); List <Ioexception> errors = New Arraylist <ioexception> ();  //  Creates a multipathfilter with the hiddenfilefilter and  //  User provided one (if any ).  //  Process the filter of a path file. You can filter out some files in the Input Folder. List <pathfilter> filters = New Arraylist <pathfilter> (); Filters. Add (hiddenfilefilter); pathfilter jobfilter =Getinputpathfilter (job );  If (Jobfilter! = Null  ) {Filters. Add (jobfilter);} pathfilter inputfilter = New  Multipathfilter (filters );  //  Traverse each Input Folder      For  (Path P: dirs) {filesystem FS = P. getfilesystem (job );  //  Obtain all the files (CLIPS) under the input file) Filestatus [] matches = FS. globstatus (p, inputfilter );  If (Matches = Null  ) {Errors. Add (  New Ioexception ("input path does not exist:" + P ));}  Else   If (Matches. Length = 0 ) {Errors. Add (  New Ioexception ("input pattern" + P + "matches 0 Files" ));}  Else {  //  Traverse each file under the input price folder (folder)          For  (Filestatus GlobStat: matches ){  If  (GlobStat. isdir ()){  //  To add all the files and folders in the folder to the result.  //  * *** Note that the object and folder are returned to the result instead of traversing down the layer.              For  (Filestatus stat: fs. liststatus (GlobStat. getpath (), inputfilter) {result. Add (STAT );}} Else  {  //  If the file is directly added to the result, there is actually no judgment on whether the file is in the format required by the input, etc.  Result. Add (GlobStat );}}}}  If (! Errors. isempty ()){  Throw   New  Invalidinputexception (errors);} log.info ( "Total input paths to process:" + Result. Size ());  Return Result. toarray (New  Filestatus [result. Size ()]);}

Is to summarize the rules of output files in sequencefileinputformat (assuming the Input Folder is/input ):

1. input files in the folder, that is, the/input/*** file.

2. input files in the subfolder,/input.

3. Enter the data file in the subfolder of the subfolder in the folder,/input/***/data file, which is mainly for mapfile.

After obtaining files, how can we map the files to inputsplit (one file may map one inputsplit, or several inputsplit maps)? For the code, see:

 /**  Splits files returned {  @ Link  # Liststatus (jobconf)} When * They're too big.  */ @ Suppresswarnings ( "Deprecation" )  Public Inputsplit [] getsplits (jobconf job, Int  Numsplits)  Throws  Ioexception {filestatus [] files = Liststatus (job );  //  Save the number of input files in the job-Conf  Job. setlong (num_input_files, files. Length );  Long Totalsize = 0;//  Compute total size      For (Filestatus file: Files ){ //  Check we have valid files        If  (File. isdir ()){  Throw   New Ioexception ("not a file:" + File. getpath ();} totalsize + = File. getlen ();}  Long Goalsize = totalsize/(numsplits = 0? 1: Numsplits );  Long Minsize = math. Max (job. getlong ("mapred. Min. Split. Size", 1 ), Minsplitsize );  //  Generate splits Arraylist <filesplit> splits = New Arraylist <filesplit> (Numsplits); networktopology clustermap = New  Networktopology ();  //  How many inputsplits should each file be divided?      For (Filestatus file: Files) {Path = File. getpath (); filesystem FS = Path. getfilesystem (job );  Long Length = File. getlen (); blocklocation [] blklocations = FS. getfileblocklocations (file, 0 , Length );  //  When a file is cut, one file can be cut into multiple inputsplits.        If (Length! = 0 )&& Issplitable (FS, PATH )){  Long Blocksize =File. getblocksize ();  //  The block size is usually measured in the unit of speed, that is, blocksize will be selected, and block will be used according to the block. This is more suitable.          Long Splitsize = Computesplitsize (goalsize, minsize, blocksize );  Long Bytesremaining = Length;  //  Split the file from offset = 0 to offset = length into (length/splitsize + 1) According to the split settings (the size of each part) (in fact, these are not the same ~ ~ The size of the last part can be splitsize * split_slop)          While ((( Double ) Bytesremaining)/splitsize>Split_slop) {string [] splithosts = Getsplithosts (blklocations, Length - Bytesremaining, splitsize, clustermap); splits. Add (  New Filesplit (path, length- Bytesremaining, splitsize, splithosts); bytesremaining -= Splitsize ;}  //  Return splits          If (Bytesremaining! = 0 ) {Splits. Add (  New Filesplit (path, length- Bytesremaining, bytesremaining, blklocations [blklocations. Length -1 ]. Gethosts ()));}}  Else   If (Length! = 0 ) {String [] splithosts = Getsplithosts (blklocations, 0 , Length, clustermap); splits. Add (  New Filesplit (path, 0 , Length, splithosts ));}  Else  {  // Create empty hosts array for zero length files Splits. Add ( New Filesplit (path, 0, length, New String [0 ]) ;}} Log. debug ( "Total # of splits:" + Splits. Size ());  Return Splits. toarray ( New  Filesplit [splits. Size ()]);}

Okay, everything is done, but you may have a question after reading the above Code: sequencefile is the key-value values stored one by one. If you split the file, will it damage the original data structure, that is, if a key-value is divided into twoFilesplitSwollen?

SeeArticle: Http://www.cnblogs.com/serendipity/articles/2112613.html

I personally feel that the mapreduce tool inputformat is messy and often does not look at it.Source codeYou will never know which files are selected.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Input inputformat -- sequencefileinputformat

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Input inputformat -- sequencefileinputformat

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support