Hadoop Source Analysis (Maptask helper class, II)

Source: Internet
Author: User
Tags index sort

with the memory structure of the above Mapper output and the hard disk storage structure discussed, we will analyze the mapoutputbuffer process in detail. The first is the member variable. The first initialization is the job configuration job and statistical function reporter. With configuration, Mapoutputbuffer is able to get the number of local file systems (Localfs and RFS), reducer and partitioner.
Spillrecord is the corresponding abstraction in memory of the file Spill.out{spill number}.index (the final checksum of the memory data and the file data), which maintains a series of Indexrecord, for example:


Indexrecord has 3 fields, each of which is startoffset: Record offset. Rawlength: initial length, Partlength: Actual length (may have compression).

Spillrecord maintains a series of Indexrecord and provides methods for adding records (Operations without deleting records, because they are not required). Get records, write files, read files (through constructors).
Next is some and output buffer kvbuffer. The buffer record index kvindices and the buffer record index sort work array kvoffsets related processing, the following figure helps to illustrate this piece of code.





This section relies on 3 configuration parameters, Io.sort.spill.percent is the total size of kvbuffer,kvindices and Kvoffsets in M. The default is 100, which is 100M, which is the most mapoutputbuffer in the storage.

Io.sort.record.percent is the proportion of space occupied by Kvindices and Kvoffsets (the default is 0.05).

In the previous analysis we already know kvindices and kvoffsets, assuming that the number of records is N, the space it occupies is 4n*4bytes, based on the relationship and the value of io.sort.record.percent. We can figure out how many records kvindices and Kvoffsets can have. and allocate the appropriate space. The number of io.sort.spill.percent indicates when the output buffer or kvindices and Kvoffsets record count reaches the appropriate occupancy rate. The spill is started and the memory buffer record is stored on the hard disk. Softbufferlimit and Softrecordlimit are the corresponding number of bytes.


The value pair <key, value> output to the buffer is serialized by serializer. The initialization of this section follows the output cache above. Next are some of the counters and possible data compression processor initialization, possible combiner and some configuration of combiner work.
Finally, start the spillthread. The thread checks the in-memory output buffers. The contents of the buffer are spill to the hard disk when certain conditions are met. This is a standard producer-consumer model, Maptask's Collect method is the producer, Spillthread is the consumer, Synchronization between them is done through the two condition variables (Spilldone and Spillready) on Spilllock (Reentrantlock) and Spilllock.


Look at the producers first. The main processes of Mapoutputbuffer.collect are:
L           Report progress and number of measurements (<K,V> conform to Mapper's output conventions);
L           Spilllock.lock (), enter the critical section.
L           Assume that the spill condition is reached. Set the variable and notify Spillthread by Spillready.signal (), and wait for the spill to end (through spilldone.await () wait);
L           Spilllock.unlock ();
L           Output Key,value and update kvindices and kvoffsets (note that the method collect is Synchronized,key and value respective outputs. They also occupy a contiguous output buffer).
Kvstart,kvend and Kvindex Three variables are important in the process of inferring whether spill and spill are required to end. Kvstart is the subscript that effectively records the beginning. Kvindex is the next place to record. Kvend has a special effect, it kvstart==kvend in ordinary cases. But when you start spill it will be assigned a value of Kvindex, at the end of the spill. Its value will be assigned to Kvstart, this time kvstart==kvend.

Say. Assuming Kvstart is not equal to Kvend, the system is spill, otherwise. Kvstart==kvend. The system is in a normal working state.

In fact in the code. We can see very much kvstart==kvend inference.


below we discuss the cooperation between Kvstart,kvend and Kvindex in the following circumstances. The time of initialization. They are all assigned a value of 0.

 


&NBSP;

Note the relationship between Kvindex and Kvnext, modulo implements the loop buffer

 

first or calculate kvnext. Mostly, this time Kvend==kvstart (not pictured).

Assuming that the spill condition is met, then the value of Kvindex is assigned to Kvend (which is kvend not equal to Kvstart), from the Kvstart and kvend size relationships, we can know the record is in that part of the array (the left is kvstart< Kvend situation, the right side is another case). At the end of the spill, the Kvend value is assigned to Kvstart,kvend==kvstart and again to meet. At the same time. We can find that Kvindex has not changed in this process. The new record is still written in the location pointed to by Kvindex, and then Kvindex=kvnect,kvindex moves to the next available location.
We appreciate the above process, especially the kvstart,kvend and kvindex, in fact, <key. Value> also has a similar process to the buffer used by the output.

Collect is dealing with <key. value> output. Will handle a mapbuffertoosmallexception, which is the serialization result of value is too large. Can not be put into the buffer at once, so the case we need to call Spillsinglerecord, special handling.

A lot of other exciting content please follow: http://bbs.superwu.cn

Focus on the two-dimensional code of Superman Academy:

Follow the Superman college Java Free Learning Exchange Group:

Hadoop Source Analysis (Maptask helper class, II)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.