Custom InputFormatInputFormat mainly includes :????? InputSplit and RecordReader ?? InputSplit is used to define the number of maps and determine the most appropriate execution node (location )?? The RecordReader reads data records from the input file and submits the data to the Mapper for processing .? The implementation of a custom part must inherit and extract
Custom InputFormat mainly includes :? ? ? ? ? InputSplit and RecordReader? ? InputSplit is used to define the number of maps and determine the most appropriate execution node (location )? ? RecordReader reads data records from the input file and submits the data to Mapper for processing. ? The implementation of a custom part must inherit and extract
Custom InputFormat
InputFormat mainly includes:
? ? ? ? ? InputSplit and RecordReader
? ?InputSplit is used to define the number of maps and determine the most appropriate execution node (location)
? ? RecordReader reads data records from the input file and submits the data to Mapper for processing.
? A custom part must be inherited from the abstract class InputSplit by defining the length and position of the input. The location of the part implies how the scheduler places a part Executor (that is, select a proper TaskTracker)
? JobTracker's sharding algorithm is roughly as follows:
- Obtain available map slot resources through TaskTracker node heartbeat
- Find out which available nodes are local from the waiting parts.
- Submit parts to TaskTracker
? Based on the storage mechanism and execution policy, the size and location of the part are different. For example, on HDFS, a shard is consistent with a physical data block. The Shard location is a set of physical storage locations of the data block.
? Below isHow FileInputFormat works:
- Inherit from InputSplit to calculate the file information. For example, the start position and length of the data block in the file
- Obtains a collection of data blocks of a file.
- Create a data shard. The length of the Shard is the same as the size of the shard. The Shard location is the fast one. It also contains information such as file location, block displacement, and length.
The code for creating a part in FileInputFormat is as follows:
/** * Generate the list of files and make them into FileSplits. * @param job the job context * @throws IOException */ public List
getSplits(JobContext job) throws IOException { Stopwatch sw = new Stopwatch().start(); long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job)); long maxSize = getMaxSplitSize(job); // generate splits List
splits = new ArrayList
(); List
files = listStatus(job); for (FileStatus file: files) { Path path = file.getPath(); long length = file.getLen(); if (length != 0) { BlockLocation[] blkLocations; if (file instanceof LocatedFileStatus) { blkLocations = ((LocatedFileStatus) file).getBlockLocations(); } else { FileSystem fs = path.getFileSystem(job.getConfiguration()); blkLocations = fs.getFileBlockLocations(file, 0, length); } if (isSplitable(job, path)) { long blockSize = file.getBlockSize(); long splitSize = computeSplitSize(blockSize, minSize, maxSize); long bytesRemaining = length; while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) { int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining); splits.add(makeSplit(path, length-bytesRemaining, splitSize, blkLocations[blkIndex].getHosts(), blkLocations[blkIndex].getCachedHosts())); bytesRemaining -= splitSize; } if (bytesRemaining != 0) { int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining); splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining, blkLocations[blkIndex].getHosts(), blkLocations[blkIndex].getCachedHosts())); } } else { // not splitable splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts(), blkLocations[0].getCachedHosts())); } } else { //Create empty hosts array for zero length files splits.add(makeSplit(path, 0, length, new String[0])); } } // Save the number of input files for metrics/loadgen job.getConfiguration().setLong(NUM_INPUT_FILES, files.size()); sw.stop(); if (LOG.isDebugEnabled()) { LOG.debug("Total # of splits generated by getSplits: " + splits.size() + ", TimeTaken: " + sw.elapsedMillis()); } return splits; }
Method To create a part mainly involves the following: 1. First, obtain the status information of the input file FileStatus from the Job object. 2. Obtain block information for each file. 3. Split the file based on the file size. If not, split the file.
? Case: computing-intensive InputFormat
There is a common MapReduce program: computing-intensive program.
Computing-intensive MR programs: complex computing algorithms are required for each input key-value pair. The main feature is that each map processing function takes longer time than obtaining the processed data, at least one magnitude. For example, the facial recognition program.
If you use the default FileInputFormat to process this type of application, the cpu load of the Department's machine is too high, while the others are relatively idle. (Ganglia can be used for monitoring and analysis)
By default, the implementation of FileInputFormat creates partition data based on the local nature of the data. However, the local nature of computing-intensive program data may not be suitable. So how should we make changes? When we obtain the computing power of all available serversServer to allocate and create parts.
Therefore, we need to reload the partition function.
Next we will reload SequenceFileInputFormat to demonstrate how to implement the above requirements:
ComputeIntensiveSequenceFileInputFormat inherits the SequenceFileInputFormat function and reloads the gitSplits function:
// Rewrite the multipart function @ Override public List
GetSplits (JobContext job) throws IOException {String [] servers = getActiveServersList (job); if (servers = null) return null; List
Splits = new ArrayList
(); List
Files = listStatus (job); int currentServer = 0; for (FileStatus file: files) {Path path = file. getPath (); long length = file. getLen (); if (length! = 0) & isSplitable (job, path) {long blockSize = file. getBlockSize (); long splitSize = computeSplitSize (blockSize, minSize, maxSize); long bytesRemaining = length; while (double) bytesRemaining)/splitSize> SPLIT_SLOP) {splits. add (new FileSplit (path, length-bytesRemaining, splitSize, new String [] {servers [currentServer]}); currentServer = getNextServer (currentServer, servers. length); TesRemaining-= splitSize;} else if (length! = 0) {splits. add (new FileSplit (path, 0, length, new String [] {servers [currentServer]}); currentServer = getNextServer (currentServer, servers. length) ;}} // Save the number of input files in the job-conf job. getConfiguration (). setLong (NUM_INPUT_FILES, files. size (); return splits;} // obtain the Server LIST private String [] getActiveServersList (JobContext context) {String [] servers = null; try {JobClient jc = new JobClient (JobConf) context. getConfiguration (); ClusterStatus status = jc. getClusterStatus (true); Collection
Atc = status. getActiveTrackerNames (); servers = new String [atc. size ()]; int s = 0; for (String serverInfo: atc) {StringTokenizer st = new StringTokenizer (serverInfo, ":"); String trackerName = st. nextToken (); StringTokenizer st1 = new StringTokenizer (trackerName, "_"); st1.nextToken (); servers [s ++] = st1.nextToken () ;}} catch (IOException e) {e. printStackTrace ();} return servers;} // select a server private static int getNextServer (int current, int max) {current ++; if (current> = max) current = 0; return current ;}
This class inherits from SequenceFileInputFormat and overwrites the getSplits () function. The calculation part is the same as FileInputFormat. The physical locality of the original data is replaced by available server resources. Two main functions:
GetActiveServersList () queries the cluster status and calculates a list of available server names
GetNextServer () gets a server
Optimization:
Is the above solution perfect? The answer is no.
We have replaced the physical locality of data above. This will lead to more data transmission problems, which will put pressure on network I/O and affect I/O performance.
So we thought we could combine the two strategies. Place as many tasks as possible locally and distribute the remaining tasks to other nodes.
The following is the program that implements this scheme: ComputeIntensiveLocalizedSequenceFileInputFormat:
@Override public List
getSplits(JobContext job) throws IOException { List
originalSplits = super.getSplits(job); String[] servers = getActiveServersList(job); if (servers == null) return null; List
splits = new ArrayList
( originalSplits.size()); int numSplits = originalSplits.size(); int currentServer = 0; for (int i = 0; i < numSplits; i++, currentServer = getNextServer( currentServer, servers.length)) { String server = servers[currentServer]; // Current server boolean replaced = false; for (InputSplit split : originalSplits) { FileSplit fs = (FileSplit) split; for (String l : fs.getLocations()) { if (l.equals(server)) { splits.add(new FileSplit(fs.getPath(), fs.getStart(), fs.getLength(), new String[] { server })); originalSplits.remove(split); replaced = true; break; } } if (replaced) break; } if (!replaced) { FileSplit fs = (FileSplit) splits.get(0); splits.add(new FileSplit(fs.getPath(), fs.getStart(), fs .getLength(), new String[] { server })); originalSplits.remove(0); } } return splits; }
Here, the first step is to use the parent class (FileInputFormat) to obtain parts to ensure local data. For each server, first try to specify the local split for it. Other parts without local partitions are randomly allocated.
Summary:
The rewriting of MapReduce input formats mainly involves two components: InputFormat and Recordreader. This article describes how InputFormat works and how to rewrite InputFormat. Based on the business characteristics, select the sharding policy.
Original article address: MR Summary (3)-MapReduce component customization. Thank you for sharing it with the original author.