Hadoop in-depth learning: maptask Detailed

Source: Internet
Author: User

we mainly come to learn the internal realization of maptask.
         
         Overall execution Process 
 
as shown, the entire processing flow of maptask is divided into five stages:
         Read phase:the data is parsed into Key/value by Recordreader from the Inputsplit shard.
         Map phase:The key/value parsed by Recordreader is handed to the map () method, and a new key/value is generated.
         Collect stage:writes the newly generated Key/value in map () by Outpcollector.collect () to the in-memory ring data buffer.
         spill Stage:when the ring buffer reaches a certain threshold, the data is written to the local disk and a spill file is generated. The data is sorted locally before the file is written, and the data is compressed when necessary (as required by the configuration).
         Combine stage:Once all the data has been processed, all temporary spill files are merged once, resulting in a data file.
     
Next we will learn more about the three phases of the most important collect, spill, and combine in the process.
         Collect Process 
after the new Key/value pair is generated in the map in the previous phase, Outpcollector.collect (Key,value) is called, and Partitioner.getpartition () is called inside the method to get the partition number of the record. The <key,value,partition> is then passed on to Mapoutputbuffer.collect () for further processing.
The Mapoutputbuffer internally uses an internal ring buffer to temporarily save the user's output data, and when the buffer usage reaches a certain threshold, the data in the buffer is spill to the local disk by the Spillthread thread, and when all the data is processed, Merges all the files and eventually generates only one file. The data buffer is directly used to think of Maptask's write efficiency.
The ring buffer allows the collect phase and the spill phase to be processed in parallel.
the Mapoutputbuffer internally uses a two-level index structure involving three ring-shaped memory buffers, kvoffsets, kvindices, and Kvbuffer, respectively, The size of this ring buffer can be set by IO.SOT.MB, the default size is 100MB, as shown below:

         kvoffsetsan offset index array that holds the offset of the key/value in Kvindices. One key/value the size of an int in the kvoffsets array, and the size of the 3 int in the kvindices array (as shown, including the starting position of the partition number Partition,key and the starting position of value).
         when kvoffsets usage exceeds io.sort.spill.percent (the default is 80%), the spilltread thread is triggered to spill the data to disk.  
         kvindicesthat is, a civilian index array that holds the actual key/value in the data buffer kvbuffer.
         Kvbufferthat is, the data bureau buffer, used to actually save the Key/value, by default can use the IO.SORT.MB 95%, When the buffer usage rate exceeds io.sort.spill.percent, the spilltread thread is triggered to spill the data to disk.

         spill Process 
During the execution of the Collect phase, when the data in the in-memory ring data buffer reaches a certain post, a spill operation is triggered and some data is spill to the local disk. The Spillthread thread is actually the consumer of the Kvbuffer buffer, the main code is as follows:

Java code

  1. Spilllock.lock ();

  2. while (true) {

  3. Spilldone.sinnal ();

  4. while (Kvstart = = kvend) {

  5. Spillready.await ();

  6. }

  7. Spilldone.unlock ();

  8. //Sort and spill the data in the buffer kvbuffer to the local disk

  9. Sortandspill ();

  10. Spilllock.lock;

  11. //Reset individual pointers and prepare for next spill

  12. if (Bufend < Bufindex && Bufindex < Bufstart) {

  13. Bufvoid = Kvbuffer.length;

  14. }

  15. Vstart = Vend;

  16. Bufstart = Bufend;

  17. }

  18. Spilllock.unlock ();


The internal flow in the Sortandspill () method is this:
The first step is to sort the data in kvbuffer[bufstart,bufend using the fast sort algorithm, sort the partition area code, then sort by key, and after these two rounds, the data will be clustered together in partitions. And the data in the same partition is ordered by key;
In the second step, the data in each partition is written to a temporary file in the working directory of the task by the size of the partition from small to large, and if the user sets the combiner, the data in each partition will be aggregated once before the file is written, such as <key1,val1> and <key1,val2> merger into <key1,<val1,val2>>;
The third step is to write the meta-information of the partition data into the memory index data structure Spillrecord. The metadata information for a partition includes the offset in the temporary file, the size of the data before compression, and the size of the data after compression.

         Combine Process 
when all the data for a task is processed, Maptask merges all of the temporary file years for that task into one large file, generating the corresponding index file. During the merge process, it is merged in a partitioned unit.
having each task eventually generate a file avoids the overhead of opening a large number of files at the same time and generating random reads for small files.


Hadoop in-depth learning: Maptask detailed

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.