Fileinputformat (Org. apache. hadoop. mapreduce. lib. input. fileinputformat) is an abstract class designed specifically for file-type data sources. It provides two functions: (1) Defining static methods for job input files; (2) the general implementation of generating slices for the input file; As for how to convert the data in the slice into individual "records", it is implemented by the specific subclass according to different file types.
Fileinputformat input pathsFileinputformat provides four static methods for defining the input file path of a job: public static void addinputpath (job, Path) public static void addinputpaths (job, string commaseparatedpaths) public static void setinputpaths (job, path... inputpaths) public static void setinputpaths (job, string commaseparatedpaths) addinputpath () and addinputpaths () are used to add one (batch) input path, which can be called repeatedly. The source code is as follows: each call to addinputpath will divide the input path (path will be converted to the string form) with the original value "," Splice the data separately (the original value is not overwritten) and save it to the input_dir (mapreduce. Input. fileinputformat. inputdir) attribute of the job configuration. Addinputpaths actually calls addinputpath cyclically. Setinputpaths () is actually two overload methods, used to "set" One (batch) input path, this method is used for one-time call, each call will overwrite the previous results, the source code is as follows: this method will replace the input_dir (mapreduce. input. fileinputformat. inputdir. The input path can represent a file or a directory (all files under this directory will be used as input data ), you can also use wildcards in the input path or use commas (,) to splice Multiple Input paths. Note: contents (subdirectories) in directories are not recursively processed. In fact, the Directory should only contain files. If the directory contains subdirectories, these subdirectories will be processed as files, leading to exceptions. If we do not need recursive directories, we can use file pattern or filter (see later) to inform fileinputformat to select only files in the specified directory. If we do need to recursively process directories, you can set mapreduce. input. fileinputformat. input. dir. the value of recursive is true. Sometimes we also need to "filter" some files in the input path, which can be implemented by setting the corresponding filter for fileinputformat using setinputpathfilter (). The source code is as follows: it is actually to specify a specific implementation class name for pathfilter (the content of pathfilter is not discussed in the scope) and save it in the job Configuration Attribute pathfilter_class (mapreduce. input. pathfilter. class. If pathfilter is not set, fileinputformat will have a default filter to filter hidden files in the directory. If pathfilter is set, the fileinputformat filter is actually a filter chain, the default filter is the first part of the filter chain and is executed first. To sum up, the input path and filter of fileinputformat can be set directly through the corresponding attribute values, as required:
Fileinputformat input splitsFileinputformat is implemented by the getsplits () method to generate slices. The core logic and Source Code are as follows: 1. determine the minimum and maximum values of the slice size. Minimum values: greater between getformatminsplitsize () and getminsplitsize. Getformatminsplitsize () is an instance method in fileinputformat. The default return value is 1, that is, 1 byte. The returned value of getminsplitsize () is determined by the attribute mapreduce. input. fileinputformat. split. minsize is determined. The default value is 1, that is, 1 byte. If there is no special need, the minimum value is 1 byte. Some data format files have requirements on the minimum part size, such as sequencefile (For details, refer to the sequencefile documentation). In this case, you need to override getformatminsplitsize () in the fileinputformat subclass () methods to meet specific needs. Maximum value: the returned value of getmaxsplitsize () is determined by the mapreduce. Input. fileinputformat. Split. maxsize attribute. The default value is long. max_value. 2. obtains information about all files in the input path. 3. iteratively process each file in the input path and generate slices for each file. For each file, the process of generating slices can be roughly summarized into the following five key steps: obtain the file path and length (1). If the file length is 0, an "empty" slice is generated (5). If the file length is not 0, obtain the data block information of the file (2). If the file format cannot be sliced, the entire file is generated as a slice (4). If the file format can be sliced, generate slices for the file (3 ). Whether the file format supports slicing is determined by the fileinputformat issplitable () method. The default return value is true, that is, slicing is supported by default. You can rewrite this method in the subclass of fileinputformat based on different actual application scenarios to make the returned value false to prevent slicing. In this way, each map task processes all the data of a file. Before detailed introduction to step 1, you must first introduce a new class filesplit, which indicates a file slice. The variables include: file: the file path referenced by the slice (name ); start: Start offset of the slice in the file; Length: slice size; hosts: the START offset, slice size, and file data block information of the slice in the file can be used to calculate the data block referenced by the slice (the slice size may be larger than the HDFS data block size ), hosts stores the Copy location (host name) of the first data block in the data block. The default value is 3. mapreduce schedules the map Task Based on this value. hostinfos: compared with the Host Name stored in hosts, it also stores information about whether the copy is in the host memory. Next we will introduce the process of forming a (slice) file slice, which can be roughly divided into five steps: Step 1 first obtain the data block size of the file blocksize (here we can also see that different files, the size of the data block can also be different. Then, the corresponding size (splitsize) of the file is calculated based on the blocksize, minsize, and maxsize of the data block ), the calculation formula is as follows: splitsize = max (minimumsize, min (maximumsize, blocksize) Step 2: Determine whether the remaining size (not sliced) of the file meets the conditions for continued slicing: (double) bytesremaining)/splitsize> split_slop is true. The initial value of bytesremaining is the file length, the value of split_slop is 1.1, and cannot be modified. That is, the remaining file size must be 1 of the slice size. Only one time before the slice continues. Step 3: obtain the data block corresponding to the slice. A slice may contain several data blocks according to the slice size. Here, the copy position of the first data block is used as the storage position of the slice. Formula for Calculating the start offset of a slice in a file: offset = (n-1) * splitsize. N indicates the offset of the first slice for the given slice, getblockindex is used to calculate the starting and ending range of the data block of the file, which exactly includes offset. The subscript of this data block in the data block list (blklocations) is returned. The calculation process is as follows: step 4 construct a filesplit based on the data block information corresponding to the subscript. According to the filesplit information, we can see that filesplit does not store data, but only associates data by file name, start offset, and size, the Copy location of the corresponding data block is used as the storage location of the slice for map task scheduling. Execute step 2, step 3, and step 4 cyclically until the remaining file size cannot meet the slicing conditions. Step 5 build a filesplit for the remaining part of the file.
Hadoop fileinputformat implementation principle and source code analysis