The jobsplitwriter of MapReduce source code analysis

Source: Internet
Author: User

Jobsplitwriter is used by job clients to write Shard-related files, including shard data file Job.split and shard metadata information file Job.splitmetainfo. It has two static member variables, as follows:

  The Shard version, which currently defaults to 1  private static final int splitversion = jobsplit.meta_split_version;  The Shard file header, the byte array "SPL" for the UTF-8 format string "SPL"  private static final byte[] Split_file_header;
Also, a static method is provided to complete the initialization of the Split_file_header with the following code:

  static method that loads a byte array of the string "SPL" Split_file_header to UTF-8 format byte[]  static {    try {      split_file_header = "SPL". GetBytes ("UTF-8");    } catch (Unsupportedencodingexception u) {      throw new runtimeexception (U);    }  }
Jobsplitwriter implements its function for the Createsplitfiles () method, which has three implementations, we first look at the public static <t extends inputsplit> void Createsplitfiles (Path jobsubmitdir,configuration conf, FileSystem FS, t[] splits), the code is as follows:

//Create Shard File public static <t extends inputsplit> void Createsplitfiles (Path jobsubmitdir, Configuration conf, FileSystem F S, t[] splits) throws IOException, Interruptedexception {//Call CreateFile () method, create Shard file, and get file system data output stream Fsdataoutputstream instance out,//the path/job to which the Jobsubmitdir/job.split,jobsubmitdir parameter yarn.app.mapreduce.am.staging-dir specified is the user user/.staging/ Job Idfsdataoutputstream out = CreateFile (FS, Jobsubmissionfiles.getjobsplitfile (Jobsubmitdir), conf);//Call Writenews        Plits () method, writes the Shard data to the Shard file, and obtains the Shard metadata information Splitmetainfo array info splitmetainfo[] Info = writenewsplits (conf, splits, out);        Turn off the output stream out.close (); Call the Writejobsplitmetainfo () method to write the Shard metadata information to the Shard metadata file Writejobsplitmetainfo (fs,jobsubmissionfiles.getjobsplitmetafile  (Jobsubmitdir), New Fspermission (jobsubmissionfiles.job_file_permission), splitversion, info); }
The logic of the Createsplitfiles () method is clear, broadly as follows:

1. Call the CreateFile () method, create the Shard file, and get the file system data output stream Fsdataoutputstream instance out, corresponding path is jobsubmitdir/ Job.split,jobsubmitdir the user user/.staging/job ID for the path/job to which the parameter yarn.app.mapreduce.am.staging-dir is specified;

2. Call the Writenewsplits () method, write the Shard data to the Shard file, and get the Shard metadata information splitmetainfo array info;

3, close the output stream out;

4. Call the Writejobsplitmetainfo () method to write the Shard metadata information to the Shard metadata file.

Let's take a look at the CreateFile () method with the following code:

 private Static Fsdataoutputstream CreateFile (FileSystem FS, Path Splitfile, Configuration job) throws IOException {//Call H The Dfs file system filesystem the Create () method to get the file system data output stream Fsdataoutputstream instance out,//corresponding permissions for Jobsubmissionfiles.job_file_permission, i.e. 0644,rw-r--r--fsdataoutputstream out = Filesystem.create (FS, Splitfile, New Fspermission (JOBSUBMISSIONFILES.J        ob_file_permission)); Gets the number of copies replication, takes the parameter mapreduce.client.submit.file.replication, the parameter is not configured by default to ten int replication = Job.getint (job.submit_        REPLICATION, 10);        By using the file system filesystem instance FS's Setreplication () method, set the copy number of the Splitfile to Fs.setreplication (Splitfile, (short) replication);        Call the Writesplitheader () method to write the Shard header information Writesplitheader (out);  Returns the file system data output stream out of the return out; }
First, call the HDFs file system filesystem's Create () method to get the file system data output stream Fsdataoutputstream instance out with the corresponding permission of Jobsubmissionfiles.job_file_ PERMISSION, namely 0644,rw-r--r--;

Second, get the number of copies replication, take the parameter mapreduce.client.submit.file.replication, the parameter is not configured by default to 10;

Then, through the file system filesystem instance FS Setreplication () method, set the splitfile copy Number 10;

Then, call the Writesplitheader () method to write the Shard header information;

Finally, the file system data output stream is returned out.

The Writesplitheader () method is specifically used to write the Shard header information to the Shard file, as follows:

  private static void Writesplitheader (Fsdataoutputstream out)   throws IOException {  //file system data output stream out write byte[], The content is "SPL"    out.write (Split_file_header) in the UTF-8 format;    File system data output stream out writes int, shard version number, currently 1    out.writeint (splitversion);  }
Very simple, first the file system data output stream out writes byte[], the content is "SPL" in the UTF-8 format, then the file system data output stream out writes int, the Shard version number, currently 1.

Next, we look at the Writenewsplits () method, which writes the Shard data to the Shard file and gets the Shard metadata information splitmetainfo the array info, the code is as follows:

  @SuppressWarnings ("unchecked") private static <t extends inputsplit> splitmetainfo[] Writenewsplits (Configurati On conf, t[] array, fsdataoutputstream out) throws IOException, Interruptedexception {//based on the size of the array, construct the same size shard metadata information s    Plitmetainfo Arrays info,//Array is actually an incoming shard array splitmetainfo[] info = new Splitmetainfo[array.length]; if (array.length! = 0) {//If there is data in array//create sequence chemical factory Serializationfactory instance Factory Serializationfactory factory      = new Serializationfactory (conf);            int i = 0; Get the maximum data block location maxblocklocations, take the parameter mapreduce.job.max.split.locations, the parameter is not configured by default to ten int maxblocklocations = Conf.getint (            Mrconfig.max_block_locations_key, Mrconfig.max_block_locations_default);            Gets the current position of the output stream out by the GetPos () method of the output stream out offset long offset = out.getpos (); Iterate through an array of each element in the split for (T Split:array) {/////GetPos () method of the output stream out to get the current position of the output stream out Prevcount long                Prevcount = Out.getpos (); Toward the output stream outWrites a string that contains the class name Text.writestring (out, Split.getclass (). GetName ()) that corresponds to split. Gets the serializer instance of the serializer serializer serializer<t> serializer = Factory.getserializer ((class<t>) s                Plit.getclass ());                Open the serializer, access the output stream out of the serializer.open (out);                Serializes split to output stream out serializer.serialize (split);                Gets the current position of the output stream out by the GetPos () method of the output stream out currcount long currcount = Out.getpos ();        Get location information Locations string[] locations = Split.getlocations () via Split's Getlocations () method; if (Locations.length > Maxblocklocations) {log.warn ("Max block location exceeded for split:" +          Split + "Splitsize:" + locations.length + "maxsize:" + maxblocklocations);        Locations = arrays.copyof (locations, maxblocklocations); }//Constructs split metadata information and adds info to the specified position,//offset is the starting position of the current split in the split file, the data length is split.getlength (), and the location information is Locati ONS info[i++] = new Jobsplit.splitmetainfo (locations, offset, split.getlength (                ));      Offset increases the current split has been written to the data size offset + = Currcount-prevcount;  }}//Return shard meta-data information Splitmetainfo array info return info; }
The logic of the Writenewsplits () method is relatively clear, broadly as follows:

1, according to the size of the array, constructs the same size shard metadata information Splitmetainfo array Info,array is actually the incoming shard array;

2. If there is data in the array:

2.1, the creation sequence chemical factory serializationfactory instance factory;

2.2, get the largest data block location maxblocklocations, take parameters mapreduce.job.max.split.locations, parameters are not configured by default to 10;

2.3, the output stream out of the GetPos () method to get the output flow out of the current position offset;

2.4. Iterate over each element of the array in the split:

2.4.1, the current position of the output stream out is prevcount by the GetPos () method of the output stream out;

2.4.2 writes a string to the output stream out, and the content is the class name corresponding to split;

2.4.3, gets the serializer instance serializer of the serializer;

2.4.4, open serializer, access output stream out;

2.4.5, the split is serialized to the output stream out;

2.4.6, the current position of the output stream out is currcount by the GetPos () method of the output stream out;

2.4.7, through the Getlocations () method of Split, obtains the position information locations;

2.4.8, to ensure that the position information locations length can not exceed maxblocklocations, more than truncated;

2.4.9, constructs split corresponding metadata information, and joins the info to specify the position, the offset is the current split in the split file the starting position, the data length is split.getlength (), the position information is locations;

2.4.10, offset increases the current split has been written to the data size;

3. Returns the Shard meta-data information Splitmetainfo array info.

Where the split object is serialized, we analyze it as an example of Filesplit, whose write () method is as follows:

  @Override public  Void Write (DataOutput out) throws IOException {//write file path full name    text.writestring (out, File.tostring ());    The starting position of the Write shard in the file    Out.writelong (start);    Writes the length of the Shard in the file    out.writelong (length);  }
It is relatively simple to write the full name of the file path, the starting position of the Shard in the file, and the length of the Shard in the file three information.

In summary, the contents of the Shard file Job.split file are:

1, file header: "SPL" +int type version number 1;

2, Shard class information: String type split corresponding class name;

3. Shard Data Information: String type file path full name +long type Shard at the beginning of the file +long the length of the type shard in the file.

In the end, when the Shard metadata information is constructed, the Jobsplit static inner class Splitmetainfo object is generated, including the Shard location information locations, the start position of split in the split file offset, Shard length Split.getlength ().

Next, let's look at how the metadata information file for the Shard is generated, let's examine the next Writejobsplitmetainfo () method, the code is as follows:

  Write Job shard metadata information private static void Writejobsplitmetainfo (FileSystem FS, Path filename, fspermission p, int splitm Etainfoversion, jobsplit.splitmetainfo[] allsplitmetainfo) throws IOException {//write the splits Meta-info t o A file for the job tracker//invokes the Create () method of the HDFs file system filesystem, generates a shard metadata information file, and obtains a file system data output stream Fsdataoutputstream instance out,// The corresponding file path is jobsubmitdir/job.splitmetainfo,jobsubmitdir for the parameter yarn.app.mapreduce.am.staging-dir the specified path/job belongs to the user user/. staging/job id//corresponding permission is jobsubmissionfiles.job_file_permission, that is 0644,rw-r--r--fsdataoutputstream out = FileSystem.cr        Eate (FS, filename, p);        Writes the Shard metadata header information to the UTF-8 format of the string "META-SPL" of the byte array byte[] Out.write (Jobsplit.meta_split_file_header);    Write Shard metadata version number splitmetainfoversion, currently 1 writableutils.writevint (out, splitmetainfoversion); Writes the number of Shard metadata, the number of splitmetainfo array for the Shard metadata information Allsplitmetainfo.length writableutils.writevint (out,        Allsplitmetainfo.length); Iterates through the Shard metadata information splitmetainfo array allsplitmetainfo each splitmetainfo,Writes an output stream for (Jobsplit.splitmetainfo splitmetainfo:allsplitmetainfo) {splitmetainfo.write (out);  }//Turn off output stream out out.close (); }
The main logic of the Writejobsplitmetainfo () method is also very clear, broadly as follows:

1, call the HDFs file system filesystem Create () method, generate the Shard metadata information file, and get the file system data output stream Fsdataoutputstream instance out, the corresponding file path is jobsubmitdir/ Job.splitmetainfo,jobsubmitdir the user user/.staging/job ID for the path/job to which the parameter yarn.app.mapreduce.am.staging-dir is specified. The corresponding permission is jobsubmissionfiles.job_file_permission, namely 0644,rw-r--r--;

2, write the Shard metadata header information UTF-8 format of the string "META-SPL" byte array byte[];

3, write the Shard metadata version number splitmetainfoversion, currently 1;

4, write the fragment metadata number, for the Shard metadata information splitmetainfo array number allsplitmetainfo.length;

5, traversing the Shard metadata information splitmetainfo array allsplitmetainfo each splitmetainfo, write to the output stream;

6. Turn off the output stream out.
Let's look at how to serialize Jobsplit.splitmetainfo, write it to a file, Jobsplit.splitmetainfo write () as follows:

    public void Write (DataOutput out) throws IOException {            //writes the number of Shard locations to the Shard metadata information file      Writableutils.writevint (out, Locations.length);      Traverse location information, write shard meta data information file for      (int i = 0; i < locations.length; i++) {        text.writestring (out, locations[i]);      }      //write the starting location      of the Shard metadata information Writableutils.writevlong (out, startoffset);      Write Shard size      Writableutils.writevlong (out, inputdatalength);    }
Metadata information for each shard, including the number of Shard locations, the location of the Shard file, the starting position of the Shard metadata information, the Shard size, and so on.

Summarize

Jobsplitwriter is used by job clients to write Shard-related files, including shard data file Job.split and shard metadata information file Job.splitmetainfo. The Shard data file Job.split stores the HDFs file path corresponding to each shard, and its starting position and length in the HDFs file. The Shard metadata information file Job.splitmetainfo stores information such as the starting position, the Shard size, and so on for each shard in the Shard data file job.split.

Job.split file Contents: File header + shard + Shard + ... + shard

File header: "SPL" + version number 1

Shard: Shard class + Shard data, Shard class =string type split corresponding class name, shard data =string type HDFs file path full name +long type shard start position in HDFs file +long type shard length in HDFs file

Job.splitmetainfo file Contents: File header + Shard metadata + Shard metadata + Shard metadata + ... + shard meta-data

File header: "META-SPL" + version number 1

Number of Shard metadata: Number of shard meta-data

Shard metadata: Number of shard locations + Shard location + starting position in Shard file Job.split + Shard size







The jobsplitwriter of MapReduce source code analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.