Input inputformat -- sequencefileinputformat

Source: Internet
Author: User

Inheritance: sequencefileinputformat extends fileinputformat implements inputformat.

 

SequencefileinputformatCodeAs follows (actually very simple ):

   /**  * The fileinputformat method is overwritten. The filestatus [] * length obtained by fileinputformat is the length of the map to be run. Each filestatus corresponds to a file.  */  @ Override  Protected Filestatus [] liststatus (jobconf job) Throws  Ioexception {filestatus [] files = Super  . Liststatus (job ); /*  Call the liststatus method of the parent class, and perform processing on your own. traverse the obtained filestatus []. When a folder is encountered, check whether it is a mapfile, if yes, remove the data files that are also sequencefile; otherwise, filter out the folders.  */      For ( Int I = 0; I <files. length; I ++ ) {Filestatus File = Files [I];  If (File. isdir ()){ //  It's a mapfile Path datafile = New PATH (file. getpath (), mapfile. data_file_name); filesystem FS = File. getpath (). getfilesystem (job );  //  Use the data file Files [I] = FS. getfilestatus (datafile );}}  Return  Files ;} 

 

Let's take a look at the liststatus (jobconf job) method of fileinputformat:

  Protected Filestatus [] liststatus (jobconf job) Throws  Ioexception {  // All input paths in the job configuration are separated by commas. Path [] dirs = Getinputpaths (job );  If (Dirs. Length = 0 ){  Throw   New Ioexception ("no input paths specified in job" );}  //  Get tokens for all the required filesystems ..  Tokencache. obtaintokensfornamenodes (job. getcredentials (), dirs, job); List <Filestatus> result = New Arraylist <filestatus>(); List <Ioexception> errors = New Arraylist <ioexception> ();  //  Creates a multipathfilter with the hiddenfilefilter and  //  User provided one (if any ).  //  Process the filter of a path file. You can filter out some files in the Input Folder. List <pathfilter> filters = New Arraylist <pathfilter> (); Filters. Add (hiddenfilefilter); pathfilter jobfilter =Getinputpathfilter (job );  If (Jobfilter! = Null  ) {Filters. Add (jobfilter);} pathfilter inputfilter = New  Multipathfilter (filters );  //  Traverse each Input Folder      For  (Path P: dirs) {filesystem FS = P. getfilesystem (job );  //  Obtain all the files (CLIPS) under the input file) Filestatus [] matches = FS. globstatus (p, inputfilter );  If (Matches = Null  ) {Errors. Add (  New Ioexception ("input path does not exist:" + P ));}  Else   If (Matches. Length = 0 ) {Errors. Add (  New Ioexception ("input pattern" + P + "matches 0 Files" ));}  Else {  //  Traverse each file under the input price folder (folder)          For  (Filestatus GlobStat: matches ){  If  (GlobStat. isdir ()){  //  To add all the files and folders in the folder to the result.  //  * *** Note that the object and folder are returned to the result instead of traversing down the layer.              For  (Filestatus stat: fs. liststatus (GlobStat. getpath (), inputfilter) {result. Add (STAT );}} Else  {  //  If the file is directly added to the result, there is actually no judgment on whether the file is in the format required by the input, etc.  Result. Add (GlobStat );}}}}  If (! Errors. isempty ()){  Throw   New  Invalidinputexception (errors);} log.info ( "Total input paths to process:" + Result. Size ());  Return Result. toarray (New  Filestatus [result. Size ()]);} 

 

Is to summarize the rules of output files in sequencefileinputformat (assuming the Input Folder is/input ):

1. input files in the folder, that is, the/input/*** file.

2. input files in the subfolder,/input.

3. Enter the data file in the subfolder of the subfolder in the folder,/input/***/data file, which is mainly for mapfile.

 

After obtaining files, how can we map the files to inputsplit (one file may map one inputsplit, or several inputsplit maps)? For the code, see:

 

 /**  Splits files returned {  @ Link  # Liststatus (jobconf)} When * They're too big.  */ @ Suppresswarnings ( "Deprecation" )  Public Inputsplit [] getsplits (jobconf job, Int  Numsplits)  Throws  Ioexception {filestatus [] files = Liststatus (job );  //  Save the number of input files in the job-Conf  Job. setlong (num_input_files, files. Length );  Long Totalsize = 0;//  Compute total size      For (Filestatus file: Files ){ //  Check we have valid files        If  (File. isdir ()){  Throw   New Ioexception ("not a file:" + File. getpath ();} totalsize + = File. getlen ();}  Long Goalsize = totalsize/(numsplits = 0? 1: Numsplits );  Long Minsize = math. Max (job. getlong ("mapred. Min. Split. Size", 1 ), Minsplitsize );  //  Generate splits Arraylist <filesplit> splits = New Arraylist <filesplit> (Numsplits); networktopology clustermap = New  Networktopology ();  //  How many inputsplits should each file be divided?      For (Filestatus file: Files) {Path = File. getpath (); filesystem FS = Path. getfilesystem (job );  Long Length = File. getlen (); blocklocation [] blklocations = FS. getfileblocklocations (file, 0 , Length );  //  When a file is cut, one file can be cut into multiple inputsplits.        If (Length! = 0 )&& Issplitable (FS, PATH )){  Long Blocksize =File. getblocksize ();  //  The block size is usually measured in the unit of speed, that is, blocksize will be selected, and block will be used according to the block. This is more suitable.          Long Splitsize = Computesplitsize (goalsize, minsize, blocksize );  Long Bytesremaining = Length;  //  Split the file from offset = 0 to offset = length into (length/splitsize + 1) According to the split settings (the size of each part) (in fact, these are not the same ~ ~ The size of the last part can be splitsize * split_slop)          While ((( Double ) Bytesremaining)/splitsize>Split_slop) {string [] splithosts = Getsplithosts (blklocations, Length - Bytesremaining, splitsize, clustermap); splits. Add (  New Filesplit (path, length- Bytesremaining, splitsize, splithosts); bytesremaining -= Splitsize ;}  //  Return splits          If (Bytesremaining! = 0 ) {Splits. Add (  New Filesplit (path, length- Bytesremaining, bytesremaining, blklocations [blklocations. Length -1 ]. Gethosts ()));}}  Else   If (Length! = 0 ) {String [] splithosts = Getsplithosts (blklocations, 0 , Length, clustermap); splits. Add (  New Filesplit (path, 0 , Length, splithosts ));}  Else  {  // Create empty hosts array for zero length files Splits. Add ( New Filesplit (path, 0, length, New String [0 ]) ;}} Log. debug ( "Total # of splits:" + Splits. Size ());  Return Splits. toarray ( New  Filesplit [splits. Size ()]);} 

Okay, everything is done, but you may have a question after reading the above Code: sequencefile is the key-value values stored one by one. If you split the file, will it damage the original data structure, that is, if a key-value is divided into twoFilesplitSwollen?

SeeArticle: Http://www.cnblogs.com/serendipity/articles/2112613.html

 

 

I personally feel that the mapreduce tool inputformat is messy and often does not look at it.Source codeYou will never know which files are selected.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.