Hadoop in-depth learning: maptask Detailed

Last Update:2015-04-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

we mainly come to learn the internal realization of maptask.

         Overall execution Process

as shown, the entire processing flow of maptask is divided into five stages:
         Read phase:the data is parsed into Key/value by Recordreader from the Inputsplit shard.
         Map phase:The key/value parsed by Recordreader is handed to the map () method, and a new key/value is generated.
         Collect stage:writes the newly generated Key/value in map () by Outpcollector.collect () to the in-memory ring data buffer.
         spill Stage:when the ring buffer reaches a certain threshold, the data is written to the local disk and a spill file is generated. The data is sorted locally before the file is written, and the data is compressed when necessary (as required by the configuration).
         Combine stage:Once all the data has been processed, all temporary spill files are merged once, resulting in a data file.

Next we will learn more about the three phases of the most important collect, spill, and combine in the process.
         Collect Process
after the new Key/value pair is generated in the map in the previous phase, Outpcollector.collect (Key,value) is called, and Partitioner.getpartition () is called inside the method to get the partition number of the record. The <key,value,partition> is then passed on to Mapoutputbuffer.collect () for further processing.
The Mapoutputbuffer internally uses an internal ring buffer to temporarily save the user's output data, and when the buffer usage reaches a certain threshold, the data in the buffer is spill to the local disk by the Spillthread thread, and when all the data is processed, Merges all the files and eventually generates only one file. The data buffer is directly used to think of Maptask's write efficiency.
The ring buffer allows the collect phase and the spill phase to be processed in parallel.
the Mapoutputbuffer internally uses a two-level index structure involving three ring-shaped memory buffers, kvoffsets, kvindices, and Kvbuffer, respectively, The size of this ring buffer can be set by IO.SOT.MB, the default size is 100MB, as shown below:

         kvoffsetsan offset index array that holds the offset of the key/value in Kvindices. One key/value the size of an int in the kvoffsets array, and the size of the 3 int in the kvindices array (as shown, including the starting position of the partition number Partition,key and the starting position of value).
         when kvoffsets usage exceeds io.sort.spill.percent (the default is 80%), the spilltread thread is triggered to spill the data to disk.
         kvindicesthat is, a civilian index array that holds the actual key/value in the data buffer kvbuffer.
         Kvbufferthat is, the data bureau buffer, used to actually save the Key/value, by default can use the IO.SORT.MB 95%, When the buffer usage rate exceeds io.sort.spill.percent, the spilltread thread is triggered to spill the data to disk.

         spill Process
During the execution of the Collect phase, when the data in the in-memory ring data buffer reaches a certain post, a spill operation is triggered and some data is spill to the local disk. The Spillthread thread is actually the consumer of the Kvbuffer buffer, the main code is as follows:

Java code

Spilllock.lock ();
while (true) {
Spilldone.sinnal ();
while (Kvstart = = kvend) {
Spillready.await ();
}
Spilldone.unlock ();
//Sort and spill the data in the buffer kvbuffer to the local disk
Sortandspill ();
Spilllock.lock;
//Reset individual pointers and prepare for next spill
if (Bufend < Bufindex && Bufindex < Bufstart) {
Bufvoid = Kvbuffer.length;
}
Vstart = Vend;
Bufstart = Bufend;
}
Spilllock.unlock ();

The internal flow in the Sortandspill () method is this:
The first step is to sort the data in kvbuffer[bufstart,bufend using the fast sort algorithm, sort the partition area code, then sort by key, and after these two rounds, the data will be clustered together in partitions. And the data in the same partition is ordered by key;
In the second step, the data in each partition is written to a temporary file in the working directory of the task by the size of the partition from small to large, and if the user sets the combiner, the data in each partition will be aggregated once before the file is written, such as <key1,val1> and <key1,val2> merger into <key1,<val1,val2>>;
The third step is to write the meta-information of the partition data into the memory index data structure Spillrecord. The metadata information for a partition includes the offset in the temporary file, the size of the data before compression, and the size of the data after compression.

Combine Process
when all the data for a task is processed, Maptask merges all of the temporary file years for that task into one large file, generating the corresponding index file. During the merge process, it is merged in a partitioned unit.
having each task eventually generate a file avoids the overhead of opening a large number of files at the same time and generating random reads for small files.

Hadoop in-depth learning: Maptask detailed

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hadoop in-depth learning: maptask Detailed

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hadoop in-depth learning: maptask Detailed

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support