MR Summary (III)-MapReduce component Customization

Last Update:2018-06-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Custom InputFormatInputFormat mainly includes :????? InputSplit and RecordReader ?? InputSplit is used to define the number of maps and determine the most appropriate execution node (location )?? The RecordReader reads data records from the input file and submits the data to the Mapper for processing .? The implementation of a custom part must inherit and extract

Custom InputFormat mainly includes :? ? ? ? ? InputSplit and RecordReader? ? InputSplit is used to define the number of maps and determine the most appropriate execution node (location )? ? RecordReader reads data records from the input file and submits the data to Mapper for processing. ? The implementation of a custom part must inherit and extract

Custom InputFormat InputFormat mainly includes:

? ? ? ? ? InputSplit and RecordReader

? ?InputSplit is used to define the number of maps and determine the most appropriate execution node (location)

? ? RecordReader reads data records from the input file and submits the data to Mapper for processing.

? A custom part must be inherited from the abstract class InputSplit by defining the length and position of the input. The location of the part implies how the scheduler places a part Executor (that is, select a proper TaskTracker)

? JobTracker's sharding algorithm is roughly as follows:

Obtain available map slot resources through TaskTracker node heartbeat
Find out which available nodes are local from the waiting parts.
Submit parts to TaskTracker

? Based on the storage mechanism and execution policy, the size and location of the part are different. For example, on HDFS, a shard is consistent with a physical data block. The Shard location is a set of physical storage locations of the data block.

? Below isHow FileInputFormat works:

Inherit from InputSplit to calculate the file information. For example, the start position and length of the data block in the file
Obtains a collection of data blocks of a file.
Create a data shard. The length of the Shard is the same as the size of the shard. The Shard location is the fast one. It also contains information such as file location, block displacement, and length.

The code for creating a part in FileInputFormat is as follows:

/**   * Generate the list of files and make them into FileSplits.   * @param job the job context   * @throws IOException   */  public List
 
   getSplits(JobContext job) throws IOException {    Stopwatch sw = new Stopwatch().start();    long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));    long maxSize = getMaxSplitSize(job);    // generate splits    List
  
    splits = new ArrayList
   
    ();    List
    
      files = listStatus(job);    for (FileStatus file: files) {      Path path = file.getPath();      long length = file.getLen();      if (length != 0) {        BlockLocation[] blkLocations;        if (file instanceof LocatedFileStatus) {          blkLocations = ((LocatedFileStatus) file).getBlockLocations();        } else {          FileSystem fs = path.getFileSystem(job.getConfiguration());          blkLocations = fs.getFileBlockLocations(file, 0, length);        }        if (isSplitable(job, path)) {          long blockSize = file.getBlockSize();          long splitSize = computeSplitSize(blockSize, minSize, maxSize);          long bytesRemaining = length;          while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {            int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);            splits.add(makeSplit(path, length-bytesRemaining, splitSize,                        blkLocations[blkIndex].getHosts(),                        blkLocations[blkIndex].getCachedHosts()));            bytesRemaining -= splitSize;          }          if (bytesRemaining != 0) {            int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);            splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,                       blkLocations[blkIndex].getHosts(),                       blkLocations[blkIndex].getCachedHosts()));          }        } else { // not splitable          splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts(),                      blkLocations[0].getCachedHosts()));        }      } else {        //Create empty hosts array for zero length files        splits.add(makeSplit(path, 0, length, new String[0]));      }    }    // Save the number of input files for metrics/loadgen    job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());    sw.stop();    if (LOG.isDebugEnabled()) {      LOG.debug("Total # of splits generated by getSplits: " + splits.size()          + ", TimeTaken: " + sw.elapsedMillis());    }    return splits;  }

Method To create a part mainly involves the following: 1. First, obtain the status information of the input file FileStatus from the Job object. 2. Obtain block information for each file. 3. Split the file based on the file size. If not, split the file.

? Case: computing-intensive InputFormat

There is a common MapReduce program: computing-intensive program.

Computing-intensive MR programs: complex computing algorithms are required for each input key-value pair. The main feature is that each map processing function takes longer time than obtaining the processed data, at least one magnitude. For example, the facial recognition program.

If you use the default FileInputFormat to process this type of application, the cpu load of the Department's machine is too high, while the others are relatively idle. (Ganglia can be used for monitoring and analysis)

By default, the implementation of FileInputFormat creates partition data based on the local nature of the data. However, the local nature of computing-intensive program data may not be suitable. So how should we make changes? When we obtain the computing power of all available serversServer to allocate and create parts.

Therefore, we need to reload the partition function.

Next we will reload SequenceFileInputFormat to demonstrate how to implement the above requirements:

ComputeIntensiveSequenceFileInputFormat inherits the SequenceFileInputFormat function and reloads the gitSplits function:

// Rewrite the multipart function @ Override public List
 
  
GetSplits (JobContext job) throws IOException {String [] servers = getActiveServersList (job); if (servers = null) return null; List
  
   
Splits = new ArrayList
   
    
(); List
    
     
Files = listStatus (job); int currentServer = 0; for (FileStatus file: files) {Path path = file. getPath (); long length = file. getLen (); if (length! = 0) & isSplitable (job, path) {long blockSize = file. getBlockSize (); long splitSize = computeSplitSize (blockSize, minSize, maxSize); long bytesRemaining = length; while (double) bytesRemaining)/splitSize> SPLIT_SLOP) {splits. add (new FileSplit (path, length-bytesRemaining, splitSize, new String [] {servers [currentServer]}); currentServer = getNextServer (currentServer, servers. length); TesRemaining-= splitSize;} else if (length! = 0) {splits. add (new FileSplit (path, 0, length, new String [] {servers [currentServer]}); currentServer = getNextServer (currentServer, servers. length) ;}} // Save the number of input files in the job-conf job. getConfiguration (). setLong (NUM_INPUT_FILES, files. size (); return splits;} // obtain the Server LIST private String [] getActiveServersList (JobContext context) {String [] servers = null; try {JobClient jc = new JobClient (JobConf) context. getConfiguration (); ClusterStatus status = jc. getClusterStatus (true); Collection
     
      
Atc = status. getActiveTrackerNames (); servers = new String [atc. size ()]; int s = 0; for (String serverInfo: atc) {StringTokenizer st = new StringTokenizer (serverInfo, ":"); String trackerName = st. nextToken (); StringTokenizer st1 = new StringTokenizer (trackerName, "_"); st1.nextToken (); servers [s ++] = st1.nextToken () ;}} catch (IOException e) {e. printStackTrace ();} return servers;} // select a server private static int getNextServer (int current, int max) {current ++; if (current> = max) current = 0; return current ;}

This class inherits from SequenceFileInputFormat and overwrites the getSplits () function. The calculation part is the same as FileInputFormat. The physical locality of the original data is replaced by available server resources. Two main functions:
GetActiveServersList () queries the cluster status and calculates a list of available server names
GetNextServer () gets a server

Optimization:
Is the above solution perfect? The answer is no.
We have replaced the physical locality of data above. This will lead to more data transmission problems, which will put pressure on network I/O and affect I/O performance.

So we thought we could combine the two strategies. Place as many tasks as possible locally and distribute the remaining tasks to other nodes.
The following is the program that implements this scheme: ComputeIntensiveLocalizedSequenceFileInputFormat:

@Override public List
 
   getSplits(JobContext job) throws IOException {  List
  
    originalSplits = super.getSplits(job);  String[] servers = getActiveServersList(job);  if (servers == null)   return null;  List
   
     splits = new ArrayList
    
     (    originalSplits.size());  int numSplits = originalSplits.size();  int currentServer = 0;  for (int i = 0; i < numSplits; i++, currentServer = getNextServer(    currentServer, servers.length)) {   String server = servers[currentServer]; // Current server   boolean replaced = false;   for (InputSplit split : originalSplits) {    FileSplit fs = (FileSplit) split;    for (String l : fs.getLocations()) {     if (l.equals(server)) {      splits.add(new FileSplit(fs.getPath(), fs.getStart(),        fs.getLength(), new String[] { server }));      originalSplits.remove(split);      replaced = true;      break;     }    }    if (replaced)     break;   }   if (!replaced) {    FileSplit fs = (FileSplit) splits.get(0);    splits.add(new FileSplit(fs.getPath(), fs.getStart(), fs      .getLength(), new String[] { server }));    originalSplits.remove(0);   }  }  return splits; }

Here, the first step is to use the parent class (FileInputFormat) to obtain parts to ensure local data. For each server, first try to specify the local split for it. Other parts without local partitions are randomly allocated.

Summary:

The rewriting of MapReduce input formats mainly involves two components: InputFormat and Recordreader. This article describes how InputFormat works and how to rewrite InputFormat. Based on the business characteristics, select the sharding policy.

Original article address: MR Summary (3)-MapReduce component customization. Thank you for sharing it with the original author.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

MR Summary (III)-MapReduce component Customization

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

MR Summary (III)-MapReduce component Customization

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support