One of the two core of Hadoop: the MapReduce Summary

Last Update:2015-04-09 Source: Internet

Author: User

Tags shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

MapReduce is a distributed computing model, proposed by Google, primarily for the search field, and the MapReduce program

In essence, it is run in parallel, so it can solve the computational problem of massive data.

The MapReduce task process is divided into two processing stages: the map phase and the reduce phase. Each stage is keyed

value pairs as input and output. Users only need to implement the map () and reduce () two functions to achieve distributed computing.

To perform the steps:

Map Task Processing:

1. Read the contents of the input file and parse it into a key-value pair (Key/value). For each line of the input file, parse to

Key-value pairs (key/value). Call the map function once for each key-value pair

2. Write your own logic, the input key value pair (Key/value) processing, converted to a new key value pair

(key/value) output.

3. Partition the output key-value pairs (key/value). (partition)

4. The data of different partitions are sorted by key and grouped. The same key/value put in

In a collection. (Shuffle)

5. The data after grouping is the statute. (Combiner, selectable)

Reduce task Processing:

1. For the output of multiple map tasks, copy the network to different reduce nodes according to different partitions.

2. Merge and sort the output of multiple map tasks. Write the reduce function's own logic, on the input

Key/value processing, converted into a new key/value output.

3. Save the output of reduce to a file (written to HDFs).

MapReduce Job Flow:

1. Code Writing

2. Job configuration (Input/output path, reduce quantity, etc.)

3. Job Submission

3.1 Commits through jobclient, communicates with Jobtracker to get a jar's storage path and Jobid.

3.2 Check the path of the input and output

3.3 Compute shard information.

3.4 will be assigned to the required resource (jar, configuration file, computed input shard) to the job ID

Named HDFs on

3.5 inform the Jobtracker that the job is ready for execution.

4. Job initialization

When Jobtracker receives the submitted job, it puts the call into an internal queue,

Dispatched by the job scheduler, by default (FIFO), and initialized.

Initialize: Creates an object that represents the running job--the split task and record information so

Track the status and process of a task.

In order to create a task list, the job scheduler first obtains the computed from the shared file system.

Enter the Shard information. Then create a map task for each shard, and the scheduler creates the corresponding number

The reduce task to run. At this point, the task is assigned an ID.

5. Task Assignment

Tasktracker runs a simple loop to periodically send a "heartbeat" to Jobtracker,

Heartbeat tells Jobtracker,tasktracker if the Tasktracker is still alive and indicates whether the

is ready to run the new task, and if so, Jobtracker will assign it a task.

6. Task execution

When Tasktracker got the mission,

1. All information will be copied locally (including jar, code, configuration information, shard information, etc.)

2.tasktracker creates a new local working directory for the task and extracts the contents of the jar file to

Under this directory.

3.tasktracker Create a new Taskrunner instance to run the task.

Taskrunner will start a new JVM to run each step. (Prevent other software from affecting

To Tasktracker, but it is possible to reuse the JVM between different tasks.

7. Updates to progress and status

Task will report to Tasktracker on a regular basis and tasktracker will periodically collect

All task information on the cluster and want to jobtracker reporting. Jobtracker will be based on all

Summary of the information reported by Tasktracker

8. Job completion

Jobtracker the task to be marked as "successful" after the last task has been received.

and writes the data results to HDFs.

Ps:

Jobtracker function: Responsible for receiving the user submitted the job, responsible for starting, tracking task execution

Tasktracker function: responsible for performing tasks

Job failed:

1.JobTracker failure

This is the most serious kind of task failure, failure mechanism--it's a single node failure, so

The job is doomed to failure. (hadoop2.0 solved)

2.tasktracker failure

Tasktracker crashes will stop sending heartbeat information to JOBT, and Jobtracker will

Tasktracker is removed from the waiting task pool, and the task is moved to a different place for execution.

Jobtracker will add Tasktracker to the blacklist.

3.task failure

The map or reduce fails to run, throwing an exception to tasktracker and suspending the task.

MapReduce startup process:

start-mapred.sh--hadoop-daemon.sh---Hadoop

-->org.apache.hadoop.mapred.jobtracker

Jobtracker Call Order:

Main---starttracker--and new Jobtracker first created in its constructor method

A scheduler, followed by the creation of an RPC server (intertrackerserver)

Tasktracker communicates with the PRC and then calls the Offerservice method for external

Provide services to start the RPC server in the Offerservice method, initialize the Jobtracker,

Call the Start method of TaskScheduler and Eagertaskinitializationlistener

Call the Start method, then call the Start method of Jobinitmanagerthread,

Because it is a thread, it calls the Jobinitmanager Run method, and then

Jobinitqueue the task queue to fetch the first task and then throw it into the thread pool,

Call the-->initjob run method again, and then call the Jobtracker Initjob method

--Jobinprogress's Inittasks

--maps = new Taskinprogress[nummaptasks] and

Reduces = new Taskinprogress[numreducetasks];

Tasktracker Call Order:

Main---new Tasktracker calls the Initialize method in its construction method,

Call Rpc.waitforproxy in the Initialize method to get a jobtracker

Proxy object, and then Tasktracker calls its own Run method,

--Offerservice Method--Transmitheartbeat The return value is

(Heartbeatresponse) is Jobtracker's instruction, in Transmitheartbeat

The Intertrackerprotocol method calls the state of heartbeat to Tasktracker

Sent to jobtracker through the RPC mechanism, the return value is the Jobtracker

Heartbeatresponse.getactions () get a specific instruction and then judge the order

The specific type that starts the execution of the task, addtotaskqueue the start type of the instruction

Join the queue, tasklauncher the task into the task queue,

--Tasklauncher's Run Method--Startnewtask method

--Localizejob Download Resources--Launchtaskforjob Start loading tasks

---Launchtask-Runner.start () start thread; -

Taskrunner call the Run method and launchjvmandwait start the Java child process

Details of the MapReduce

Serialization Concepts

Serialization: Refers to converting a structured object into a byte stream.

Deserialization: Is the inverse of the serialization process. That is, the byte is transferred back to the structured object.

Features of the Hadoop serialization format:

1. Compact: Efficient use of storage space

2. Fast: Little extra overhead for reading and writing data

3. Extensible: Can read the data of old format transparently

4. Interop: Support multi-lingual interaction.

The role of Hadoop serialization:

Serialization is two important roles in distributed environments: interprocess communication, permanent storage.

Communication between Hadoop nodes.

Partitioner programming

Data that has some common characteristics is written to the same file.

Sorting and Grouping

When sorting in the map and reduce phases, the comparison is K2. V2 are not involved in sorting comparisons.

If you want V2 to be sorted, you need to assemble K2 and V2 into new classes as K2,

To participate in the comparison. If you want to customize the collation, the sorted object is implemented

Writablecomparable interface, implementing collations in the CompareTo method,

This object is then treated as a K2, and the sorting grouping is done by K2.

Combiners programming

1. Each map generates a lot of output, and combiner is the function of the map end to the output

Do a merge first to reduce the amount of data transferred to reducer.

2.combiner most basic is the merging of local key, with similar local reduce function

If you do not have combiner, then all the results are reduced, the efficiency will be relatively low,

3. Using Combiner, the first map will be aggregated locally to increase the speed.

The output of the Ps:combiner is the input of the reducer, combiner absolutely cannot change the final calculation result.

So from a personal point of view, combiner only applies to that kind of reducer input key/value and

The output Key/value type is exactly the same and does not affect the final result of the scene. For example: cumulative, Maximum, etc.

Shuffle

MapReduce ensures that each reducer input is sorted by key, and the system performs the sorting process

---Pass the map output as input to reducer---become shuffle (shuffle)

When the 1.map function starts producing output, it is not simply written to disk. It uses buffered

The method is written to memory, and is pre-sorted for efficiency considerations.

Each map task has a ring memory buffer that stores the output of the task. By default,

Buffer size is 100MB, once the buffered content reaches the threshold (default is 80%), a background thread

The content is then written to a new overflow file in the disk-specified directory. In the process of writing to disk,

The map output continues to be written to the buffer, but if the buffer is filled during this time, the map will block,

Until the write disk process is complete.

2. Before writing the disk, be partition,sort. If there is combiner,combine sorted after data.

3. When the final record is finished, merge all the overflow files into a partitioned and sorted file.

How does reducer know which tasktracker to get the map output from?

When the map task completes successfully, they notify their parent that the Tasktracker state has been updated, and then Tasktracker

Then inform Jobtracker. These notifications are transmitted in the heartbeat mechanism. Therefore, for a specified job,

Jobtracker knows the mapping relationship between the map output and the Tasktracker. One of the Reducer

The thread periodically asks Jobtracker to get the location of the map output until it obtains all the output locations.

http://m.oschina.net/blog/213034

One of the two core of Hadoop: the MapReduce summary

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More