The fundamentals of MapReduce

Source: Internet
Author: User

MapReduce Role
client: Job submission initiator.
Jobtracker: Initializes the job, allocates the job, communicates with Tasktracker, and coordinates the entire job.
Tasktracker: Maintains Jobtracker communication and performs a mapreduce task on the allocated data fragment.
Submit Job
• The job needs to be configured before the job is submitted
• program code, mainly the MapReduce program written by itself.
• Input/output path
• Other configurations, such as output compression.
• After the configuration is complete, submit via Jobclinet
initialization of the job
• After the client submission is complete, Jobtracker will queue the job and dispatch it, and the default scheduling method is the FIFO debug mode.
Assignment of Tasks
The communication and assignment between Tasktracker and Jobtracker is done through the heartbeat mechanism.
Tasktracker will take the initiative to jobtracker ask whether there is homework to do, if you can do, then will apply to the job task, this task can make map may also be reduce task.
Execution of tasks
• After applying to the task, Tasktracker will do the following things:
• Copy code to Local
• Copy the task information to a local
• Start the JVM run task
updates for status and Tasks
• The task in the process of operation, the first will be their status report to Tasktracker, and then by the Tasktracker summary of the Jobtracker.
• Task progress is achieved by counter.


completion of the job
Jobtracker The task is marked as successful after it has been accepted until the last task is completed.
• This will be done after the removal of intermediate results and other remedial work.



Part Two: error handling
Task failed

MapReduce in the design of the illusion that the task will fail, so do a lot of work to ensure fault tolerance.
• One scenario: sub-task failure
• Another scenario: a sub-task JVM abruptly quits
• Suspend the task
tasktracker Failure
Tasktracker will stop sending heartbeat messages to Jobtracker after a crash.
Jobtracker will remove the tasktracker from the waiting task pool. and move the task on the Tasktracker to another place to rerun.
Tasktracker can be jobtracker into the blacklist, even if it does not fail.


jobtracker Failure
• Single point of failure, Hadoop new version 0.23 resolves this issue.
Part III: Job scheduling
FIFO
The default scheduler in Hadoop, which first selects the jobs to be executed according to the priority of the job and then the time of arrival

Fair Scheduler
The method of assigning resources to a task, which is intended to allow the submitted job to get the same amount of cluster shared resources over time, allowing the user to share the cluster fairly. The practice is that when there is only one task on the cluster running, it will use the entire cluster, and when there are other job submissions, the system will assign the time slices of the tasktracker node space to these new jobs, and ensure that each task gets a roughly equal amount of CPU time.

Capacity Scheduler

Support for multiple queues, each of which can be configured with a certain amount of resources, each with a FIFO scheduling policy, in order to prevent the same user's job exclusive queue of resources, the scheduler will be the same user submitted a job to limit the amount of resources. When scheduling, first select a suitable queue according to the following policy: Calculate the ratio of the number of running tasks in each queue to the compute resources that should be assigned, select a queue with the lowest ratio; then select one of the jobs in the queue by the following policy: Select by job priority and commit time order, Consider both user resource limits and memory limits. But an inalienable
Configure Fair Scheduler
1. Modify Mapred-stie.xml Add the following:
<property>
<name>mapred.jobtracker.taskScheduler</name>
<value>org.apache.hadoop.mapred.FairScheduler</value>
</property>
<property>
<name>mapred.fairscheduler.allocation.file</name>
<value>/opt/hadoop/conf/allocations.xml</value>
</property>
<property>
<name>mapred.fairscheduler.poolnameproperty</name>
<value>pool.name</value>
</property>

2. Created under Hadoop conf
Allocations.xml
Content is
<?xml version= "1.0"?>
<alloctions>
</alloctions>
Examples:
<pool name= "Sample_pool" >
<minMaps>5</minMaps>
<minReduces>5</minReduces>
<weight>2.0</weight>
</pool>
<user name= "Sample_user" >
<maxRunningJobs>6</maxRunningJobs>
</user>
<userMaxJobsDefault>3</userMaxJobsDefault>
3. Restart Jobtracker
4. Visit Http://jobTracker:50030/scheduler to view the Farischeduler UI
5. Submit a Task Test




Part IV: Shuffle and sorting
After the Mapreduce map is over, the data is re-organized as input to the reduce phase, which is called the shuffle---shuffle.
The data is sorted on both the Map and the Reduce side.
Map

The output of the map is controlled by the collector
• We start with the Collect function
Reduce
The shuffle process of reduce is divided into three stages: Copy map output, sort merge, reduce processing.
• Main code in reduce's run function

Shuffle Optimization

• First of all, Hadoop's shuffle is not optimal in some cases, for example, if you need to merge with 2, the sort operation is not required.
• We can optimize the shuffle by adjusting the parameters.
Map End
IO.SORT.MB
Reduce End
mapred.job.reduce.input.buffer.percent

Part V: Some of the unique concepts of task execution
Speculative Execution

• The task of each job has a running time, and because of the heterogeneity of the machine, it may cause some tasks to be much slower than the average running time of all tasks.
• MapReduce will attempt to restart slow tasks on other machines. In order to be task fast run complete.
• This property is enabled by default.


JVM Reuse

• Starting the JVM is a relatively time-consuming task, so there is a mechanism for JVM reuse in MapReduce.
• The condition is the task of unifying the jobs.
• Reuse can be defined by Mapred.job.reuse.jvm.num.tasks if the property is-1 then unrestricted.




skipping Bad records
• Some records of the data do not conform to the specification, and when processing throws an exception, MapReduce can tell the record to be marked as a bad record. The record is skipped when the task is restarted.
• This property is off by default.

Task Execution Environment
Hadoop provides a running environment for map and reduce tasks.
• Such as: Map can know its own processing of files
• Problem: Multiple tasks may write a file at the same time
• Workaround: Write the output to the temporary folder of the task. Directory is: {mapred.out. put.dir}/temp/${mapred.task.id}





Part VI:types and formats of MapReduce
type
The type of MapReduce uses a key-value pair as the input type (key,value) • The input and output data type is set by the format of the input and output.
Input Format
• Input shards and records
• File input
• Text input
• Binary input
• Multi-file input
• Input to the database format

entering shards and records
Hadoop represents a shard by Inputsplit.
• A shard is not the data itself, but a reference to the Shard data.
The InputFormat interface is responsible for generating shards

<ignore_js_op>

file input
• Implementation class: Fileinputformat
• Use the file as the base class for the input source.
• Four methods:
Addinputpath ()
addinputpaths ()
Setinputpath ()
setinputpaths ()
Fileinputformat will split the file according to the size of the HDFS block
• Avoid segmentation
• Inherit Fileinputformat overload issplitable ()
return False

Text Input

• Implementation class: Textinputformat
Textinputformat is the default input format.
Including
Keyvaluetextinputformat
Nlineinputformat
XML
• The relationship between the input shard and the HDFs block
A record of Textinputformat may exist across blocks

Binary input

• Implementation class: Sequencefileinputformat
• Process binary Data
Including
Sequencefileastextinputformat
Sequencefileasbinaryinputformat

Multi-file input

• Implementation class: Multipleinputs
• Handle multiple file entries
Including
Addinputpath

Database Input

• Implementation class: Dbinputformat
• Pay attention to the use of the database because there are too many connections.

output Format
• Text output
• Binary output
• Multi-file output
• Output from the database format
Text Output
• Implementation class: Textoutputformat
• Default output mode
• Output as "key \ t Value"
Binary Output

• Base class: Sequencefileoutputformat
• Implementation class: Sequencefileastextoutputformat
Mapfileoutputformat
Sequencefileasbinaryoutputformat

Multi-file output

Mutipleoutputformat mutipleoutputs
• The difference between the two is that mutipleoutputs can produce different types of output
database Format output • implementation class
Dboutputformat

The fundamentals of MapReduce

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.