MR Summary (III)-MapReduce component Customization

Source: Internet
Author: User
Custom InputFormatInputFormat mainly includes :????? InputSplit and RecordReader ?? InputSplit is used to define the number of maps and determine the most appropriate execution node (location )?? The RecordReader reads data records from the input file and submits the data to the Mapper for processing .? The implementation of a custom part must inherit and extract

Custom InputFormat mainly includes :? ? ? ? ? InputSplit and RecordReader? ? InputSplit is used to define the number of maps and determine the most appropriate execution node (location )? ? RecordReader reads data records from the input file and submits the data to Mapper for processing. ? The implementation of a custom part must inherit and extract

Custom InputFormat InputFormat mainly includes:

? ? ? ? ? InputSplit and RecordReader

? ?InputSplit is used to define the number of maps and determine the most appropriate execution node (location)

? ? RecordReader reads data records from the input file and submits the data to Mapper for processing.

? A custom part must be inherited from the abstract class InputSplit by defining the length and position of the input. The location of the part implies how the scheduler places a part Executor (that is, select a proper TaskTracker)

? JobTracker's sharding algorithm is roughly as follows:

  1. Obtain available map slot resources through TaskTracker node heartbeat
  2. Find out which available nodes are local from the waiting parts.
  3. Submit parts to TaskTracker

? Based on the storage mechanism and execution policy, the size and location of the part are different. For example, on HDFS, a shard is consistent with a physical data block. The Shard location is a set of physical storage locations of the data block.

? Below isHow FileInputFormat works:

  1. Inherit from InputSplit to calculate the file information. For example, the start position and length of the data block in the file
  2. Obtains a collection of data blocks of a file.
  3. Create a data shard. The length of the Shard is the same as the size of the shard. The Shard location is the fast one. It also contains information such as file location, block displacement, and length.

The code for creating a part in FileInputFormat is as follows:

/**   * Generate the list of files and make them into FileSplits.   * @param job the job context   * @throws IOException   */  public List
 
   getSplits(JobContext job) throws IOException {    Stopwatch sw = new Stopwatch().start();    long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));    long maxSize = getMaxSplitSize(job);    // generate splits    List
  
    splits = new ArrayList
   
    ();    List
    
      files = listStatus(job);    for (FileStatus file: files) {      Path path = file.getPath();      long length = file.getLen();      if (length != 0) {        BlockLocation[] blkLocations;        if (file instanceof LocatedFileStatus) {          blkLocations = ((LocatedFileStatus) file).getBlockLocations();        } else {          FileSystem fs = path.getFileSystem(job.getConfiguration());          blkLocations = fs.getFileBlockLocations(file, 0, length);        }        if (isSplitable(job, path)) {          long blockSize = file.getBlockSize();          long splitSize = computeSplitSize(blockSize, minSize, maxSize);          long bytesRemaining = length;          while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {            int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);            splits.add(makeSplit(path, length-bytesRemaining, splitSize,                        blkLocations[blkIndex].getHosts(),                        blkLocations[blkIndex].getCachedHosts()));            bytesRemaining -= splitSize;          }          if (bytesRemaining != 0) {            int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);            splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,                       blkLocations[blkIndex].getHosts(),                       blkLocations[blkIndex].getCachedHosts()));          }        } else { // not splitable          splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts(),                      blkLocations[0].getCachedHosts()));        }      } else {        //Create empty hosts array for zero length files        splits.add(makeSplit(path, 0, length, new String[0]));      }    }    // Save the number of input files for metrics/loadgen    job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());    sw.stop();    if (LOG.isDebugEnabled()) {      LOG.debug("Total # of splits generated by getSplits: " + splits.size()          + ", TimeTaken: " + sw.elapsedMillis());    }    return splits;  }
    
   
  
 

Method To create a part mainly involves the following: 1. First, obtain the status information of the input file FileStatus from the Job object. 2. Obtain block information for each file. 3. Split the file based on the file size. If not, split the file.

? Case: computing-intensive InputFormat

There is a common MapReduce program: computing-intensive program.

Computing-intensive MR programs: complex computing algorithms are required for each input key-value pair. The main feature is that each map processing function takes longer time than obtaining the processed data, at least one magnitude. For example, the facial recognition program.

If you use the default FileInputFormat to process this type of application, the cpu load of the Department's machine is too high, while the others are relatively idle. (Ganglia can be used for monitoring and analysis)

By default, the implementation of FileInputFormat creates partition data based on the local nature of the data. However, the local nature of computing-intensive program data may not be suitable. So how should we make changes? When we obtain the computing power of all available serversServer to allocate and create parts.

Therefore, we need to reload the partition function.

Next we will reload SequenceFileInputFormat to demonstrate how to implement the above requirements:

ComputeIntensiveSequenceFileInputFormat inherits the SequenceFileInputFormat function and reloads the gitSplits function:

// Rewrite the multipart function @ Override public List
 
  
GetSplits (JobContext job) throws IOException {String [] servers = getActiveServersList (job); if (servers = null) return null; List
  
   
Splits = new ArrayList
   
    
(); List
    
     
Files = listStatus (job); int currentServer = 0; for (FileStatus file: files) {Path path = file. getPath (); long length = file. getLen (); if (length! = 0) & isSplitable (job, path) {long blockSize = file. getBlockSize (); long splitSize = computeSplitSize (blockSize, minSize, maxSize); long bytesRemaining = length; while (double) bytesRemaining)/splitSize> SPLIT_SLOP) {splits. add (new FileSplit (path, length-bytesRemaining, splitSize, new String [] {servers [currentServer]}); currentServer = getNextServer (currentServer, servers. length); TesRemaining-= splitSize;} else if (length! = 0) {splits. add (new FileSplit (path, 0, length, new String [] {servers [currentServer]}); currentServer = getNextServer (currentServer, servers. length) ;}} // Save the number of input files in the job-conf job. getConfiguration (). setLong (NUM_INPUT_FILES, files. size (); return splits;} // obtain the Server LIST private String [] getActiveServersList (JobContext context) {String [] servers = null; try {JobClient jc = new JobClient (JobConf) context. getConfiguration (); ClusterStatus status = jc. getClusterStatus (true); Collection
     
      
Atc = status. getActiveTrackerNames (); servers = new String [atc. size ()]; int s = 0; for (String serverInfo: atc) {StringTokenizer st = new StringTokenizer (serverInfo, ":"); String trackerName = st. nextToken (); StringTokenizer st1 = new StringTokenizer (trackerName, "_"); st1.nextToken (); servers [s ++] = st1.nextToken () ;}} catch (IOException e) {e. printStackTrace ();} return servers;} // select a server private static int getNextServer (int current, int max) {current ++; if (current> = max) current = 0; return current ;}
     
    
   
  
 

This class inherits from SequenceFileInputFormat and overwrites the getSplits () function. The calculation part is the same as FileInputFormat. The physical locality of the original data is replaced by available server resources. Two main functions:
GetActiveServersList () queries the cluster status and calculates a list of available server names
GetNextServer () gets a server

Optimization:
Is the above solution perfect? The answer is no.
We have replaced the physical locality of data above. This will lead to more data transmission problems, which will put pressure on network I/O and affect I/O performance.

So we thought we could combine the two strategies. Place as many tasks as possible locally and distribute the remaining tasks to other nodes.
The following is the program that implements this scheme: ComputeIntensiveLocalizedSequenceFileInputFormat:

@Override public List
 
   getSplits(JobContext job) throws IOException {  List
  
    originalSplits = super.getSplits(job);  String[] servers = getActiveServersList(job);  if (servers == null)   return null;  List
   
     splits = new ArrayList
    
     (    originalSplits.size());  int numSplits = originalSplits.size();  int currentServer = 0;  for (int i = 0; i < numSplits; i++, currentServer = getNextServer(    currentServer, servers.length)) {   String server = servers[currentServer]; // Current server   boolean replaced = false;   for (InputSplit split : originalSplits) {    FileSplit fs = (FileSplit) split;    for (String l : fs.getLocations()) {     if (l.equals(server)) {      splits.add(new FileSplit(fs.getPath(), fs.getStart(),        fs.getLength(), new String[] { server }));      originalSplits.remove(split);      replaced = true;      break;     }    }    if (replaced)     break;   }   if (!replaced) {    FileSplit fs = (FileSplit) splits.get(0);    splits.add(new FileSplit(fs.getPath(), fs.getStart(), fs      .getLength(), new String[] { server }));    originalSplits.remove(0);   }  }  return splits; }
    
   
  
 

Here, the first step is to use the parent class (FileInputFormat) to obtain parts to ensure local data. For each server, first try to specify the local split for it. Other parts without local partitions are randomly allocated.

Summary:

The rewriting of MapReduce input formats mainly involves two components: InputFormat and Recordreader. This article describes how InputFormat works and how to rewrite InputFormat. Based on the business characteristics, select the sharding policy.

Original article address: MR Summary (3)-MapReduce component customization. Thank you for sharing it with the original author.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.