Job Flow: Mapper class Analysis

Source: Internet
Author: User

This article immediately follows the job flow: the factors that determine the number of maps ,after the job submission is completed, is handled by the Mapper class.

1). The setup () and Cleanup () two methods in the Mapper class are responsible for the initialization and cleanup of the map task ( default is an empty implementation )

2). The Run () method in the Mapper class is responsible for invoking the user-defined map () method. The main code is the while () loop . The context class is an inner class that inherits from the Mapcontext interface and indirectly inherits from the Taskinputoutputcontext class. There is a nextkeyvalue () abstract method in this class that resolves inputsplit into key-value pairs . A concrete parsing implementation is provided in its subclass Mapcontextimpl.

Among them,reader is an instance of the Recordreader class. It can be seen thatmap parsing inputsplit into key-value pairs is done by invoking the Nextkeyvalue () method of the Recordreader class .

3). The map result is method writes memory, actually writes mapoutputbuffer class. The first stage in this class instantiation is the initialization of the init () procedure, which initializes the memory buf based on the configuration information:

    • partition: Reads the number of partitions set in the job and defaults to 1.
    • SORTMB: Size of memory buf, default 100MB
    • spillper: Memory buf threshold, default 0.8, i.e. 100*0.8=80MB
    • indexcachememorylimit: The size of the memory index. Default is 1024*1024
    • Sorter: Sort mapper Output key, default is fast row quicksort

Mr Execution process

1) The client submits a MapReduce jar package to Jobclient (Submission method:Hadoop jar...), jobclient is the commit node

2) Jobclient communicates with RM via RPC protocol and returns a JobId and HDFs path for storing jar packets

3) Jobclient uses filesystem to write the jar package to HDFs (path = address + JobId on HDFs). Default 10 copies (mapreduce.client.submit.file.replication), the end of operation will be deleted.

4) Start to submit the Mr Task, submit the Task description information (not the jar package, but jobid,jar the location, configuration information, etc.) to RM.

5) RM initializes the task . The description of the task is stored in the scheduler (which is the queue scheduler by default), and NM receives the task from RM through the heartbeat heartbeat mechanism. NM Start the appropriate subprocess Applicationmaster Run the task after it is picked up

6) Applicationmaster reads the files to be processed on the HDFS and begins to compute the input shards, each of which corresponds to a mappertask

7) NM continues to receive task resources (description of tasks) through heartbeat heartbeat mechanism

8) Download the required jar packages, configuration files, etc.

9) NM initiates a subprocess Yarn-childto perform a specific subtask (Mappertask or Reducertask)

10) Write the final result into HDFs

Job Flow: Mapper class Analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.