Inheritance: sequencefileinputformat extends fileinputformat implements inputformat.
SequencefileinputformatCodeAs follows (actually very simple ):
/** * The fileinputformat method is overwritten. The filestatus [] * length obtained by fileinputformat is the length of the map to be run. Each filestatus corresponds to a file. */ @ Override Protected Filestatus [] liststatus (jobconf job) Throws Ioexception {filestatus [] files = Super . Liststatus (job ); /* Call the liststatus method of the parent class, and perform processing on your own. traverse the obtained filestatus []. When a folder is encountered, check whether it is a mapfile, if yes, remove the data files that are also sequencefile; otherwise, filter out the folders. */ For ( Int I = 0; I <files. length; I ++ ) {Filestatus File = Files [I]; If (File. isdir ()){ // It's a mapfile Path datafile = New PATH (file. getpath (), mapfile. data_file_name); filesystem FS = File. getpath (). getfilesystem (job ); // Use the data file Files [I] = FS. getfilestatus (datafile );}} Return Files ;}
Let's take a look at the liststatus (jobconf job) method of fileinputformat:
Protected Filestatus [] liststatus (jobconf job) Throws Ioexception { // All input paths in the job configuration are separated by commas. Path [] dirs = Getinputpaths (job ); If (Dirs. Length = 0 ){ Throw New Ioexception ("no input paths specified in job" );} // Get tokens for all the required filesystems .. Tokencache. obtaintokensfornamenodes (job. getcredentials (), dirs, job); List <Filestatus> result = New Arraylist <filestatus>(); List <Ioexception> errors = New Arraylist <ioexception> (); // Creates a multipathfilter with the hiddenfilefilter and // User provided one (if any ). // Process the filter of a path file. You can filter out some files in the Input Folder. List <pathfilter> filters = New Arraylist <pathfilter> (); Filters. Add (hiddenfilefilter); pathfilter jobfilter =Getinputpathfilter (job ); If (Jobfilter! = Null ) {Filters. Add (jobfilter);} pathfilter inputfilter = New Multipathfilter (filters ); // Traverse each Input Folder For (Path P: dirs) {filesystem FS = P. getfilesystem (job ); // Obtain all the files (CLIPS) under the input file) Filestatus [] matches = FS. globstatus (p, inputfilter ); If (Matches = Null ) {Errors. Add ( New Ioexception ("input path does not exist:" + P ));} Else If (Matches. Length = 0 ) {Errors. Add ( New Ioexception ("input pattern" + P + "matches 0 Files" ));} Else { // Traverse each file under the input price folder (folder) For (Filestatus GlobStat: matches ){ If (GlobStat. isdir ()){ // To add all the files and folders in the folder to the result. // * *** Note that the object and folder are returned to the result instead of traversing down the layer. For (Filestatus stat: fs. liststatus (GlobStat. getpath (), inputfilter) {result. Add (STAT );}} Else { // If the file is directly added to the result, there is actually no judgment on whether the file is in the format required by the input, etc. Result. Add (GlobStat );}}}} If (! Errors. isempty ()){ Throw New Invalidinputexception (errors);} log.info ( "Total input paths to process:" + Result. Size ()); Return Result. toarray (New Filestatus [result. Size ()]);}
Is to summarize the rules of output files in sequencefileinputformat (assuming the Input Folder is/input ):
1. input files in the folder, that is, the/input/*** file.
2. input files in the subfolder,/input.
3. Enter the data file in the subfolder of the subfolder in the folder,/input/***/data file, which is mainly for mapfile.
After obtaining files, how can we map the files to inputsplit (one file may map one inputsplit, or several inputsplit maps)? For the code, see:
/** Splits files returned { @ Link # Liststatus (jobconf)} When * They're too big. */ @ Suppresswarnings ( "Deprecation" ) Public Inputsplit [] getsplits (jobconf job, Int Numsplits) Throws Ioexception {filestatus [] files = Liststatus (job ); // Save the number of input files in the job-Conf Job. setlong (num_input_files, files. Length ); Long Totalsize = 0;// Compute total size For (Filestatus file: Files ){ // Check we have valid files If (File. isdir ()){ Throw New Ioexception ("not a file:" + File. getpath ();} totalsize + = File. getlen ();} Long Goalsize = totalsize/(numsplits = 0? 1: Numsplits ); Long Minsize = math. Max (job. getlong ("mapred. Min. Split. Size", 1 ), Minsplitsize ); // Generate splits Arraylist <filesplit> splits = New Arraylist <filesplit> (Numsplits); networktopology clustermap = New Networktopology (); // How many inputsplits should each file be divided? For (Filestatus file: Files) {Path = File. getpath (); filesystem FS = Path. getfilesystem (job ); Long Length = File. getlen (); blocklocation [] blklocations = FS. getfileblocklocations (file, 0 , Length ); // When a file is cut, one file can be cut into multiple inputsplits. If (Length! = 0 )&& Issplitable (FS, PATH )){ Long Blocksize =File. getblocksize (); // The block size is usually measured in the unit of speed, that is, blocksize will be selected, and block will be used according to the block. This is more suitable. Long Splitsize = Computesplitsize (goalsize, minsize, blocksize ); Long Bytesremaining = Length; // Split the file from offset = 0 to offset = length into (length/splitsize + 1) According to the split settings (the size of each part) (in fact, these are not the same ~ ~ The size of the last part can be splitsize * split_slop) While ((( Double ) Bytesremaining)/splitsize>Split_slop) {string [] splithosts = Getsplithosts (blklocations, Length - Bytesremaining, splitsize, clustermap); splits. Add ( New Filesplit (path, length- Bytesremaining, splitsize, splithosts); bytesremaining -= Splitsize ;} // Return splits If (Bytesremaining! = 0 ) {Splits. Add ( New Filesplit (path, length- Bytesremaining, bytesremaining, blklocations [blklocations. Length -1 ]. Gethosts ()));}} Else If (Length! = 0 ) {String [] splithosts = Getsplithosts (blklocations, 0 , Length, clustermap); splits. Add ( New Filesplit (path, 0 , Length, splithosts ));} Else { // Create empty hosts array for zero length files Splits. Add ( New Filesplit (path, 0, length, New String [0 ]) ;}} Log. debug ( "Total # of splits:" + Splits. Size ()); Return Splits. toarray ( New Filesplit [splits. Size ()]);}
Okay, everything is done, but you may have a question after reading the above Code: sequencefile is the key-value values stored one by one. If you split the file, will it damage the original data structure, that is, if a key-value is divided into twoFilesplitSwollen?
SeeArticle: Http://www.cnblogs.com/serendipity/articles/2112613.html
I personally feel that the mapreduce tool inputformat is messy and often does not look at it.Source codeYou will never know which files are selected.