Hadoop architecture Design, operating principles detailed

Source: Internet
Author: User
Keywords Run value can 241 copy

1. Logical process of map-reduce

Suppose we need to process a batch of weather-related data in the following format:

Stored in ASCII code, one record per line

Each line of characters counts from 0, and the 15th to the 18th character Fu Weihan

The 25th to 29th character is the temperature, where the 25th digit is the symbol + +

0067011990999991950051507+0000+

0043011990999991950051512+0022+

0043011990999991950051518-0011+

0043012650999991949032412+0111+

0043012650999991949032418+0078+

0067011990999991937051507+0001+

0043011990999991937051512-0002+

0043011990999991945051518+0001+

0043012650999991945032412+0002+

0043012650999991945032418+0078+

Now you need to count the maximum temperature per year.

Map-reduce consists of two main steps: Map and reduce

Each step has a key-http://www.aliyun.com/zixun/aggregation/9541.html ">value" as input and output:

The format of the Key-value pair in the map phase is determined by the format of the input, and if the default Textinputformat, each row is processed as a record process, where key is the beginning of the line relative to the start of the file, and value is the character text of the row

The format of the key-value pairs of outputs in the map phase must correspond to the format of the input key-value pairs in the reduce phase

For the above example, in the map process, the input key-value is as follows:

(0,0067011990999991950051507+0000+)

(33,0043011990999991950051512+0022+)

(66,0043011990999991950051518-0011+)

(99,0043012650999991949032412+0111+)

(132,0043012650999991949032418+0078+)

(165,0067011990999991937051507+0001+)

(198,0043011990999991937051512-0002+)

(231,0043011990999991945051518+0001+)

(264,0043012650999991945032412+0002+)

(297,0043012650999991945032418+0078+)

In the map process, the key-value of the year-temperature is obtained by parsing the string of each line:

(1950, 0)

(1950, 22)

(1950,-11)

(1949, 111)

(1949, 78)

(1937, 1)

(1937,-2)

(1945, 1)

(1945, 2)

(1945, 78)

In the reduce process, the output in the map process is placed in the same list with the same key as the input of reduce

(1950, [0, 22,–11])

(1949, [111, 78])

(1937, [1,-2])

(1945, [1, 2, 78])

In the reduce process, select the maximum temperature in the list, and the key-value of the year-maximum temperature as the output:

(1950, 22)

(1949, 111)

(1937, 1)

(1945, 78)

Its logical process can be represented as the following figure:

  

The following figure probably describes the rationale for Map-reduce's job operation:

  

Here we discuss jobconf, which has a number of items to configure:

Setinputformat: Set input format for map, default to Textinputformat,key to Longwritable,value text

Setnummaptasks: Set the number of map tasks, this setting usually does not work, the number of map tasks depends on the number of input data can be divided into inputsplit

Setmapperclass: Set mapper, default to Identitymapper

Setmaprunnerclass: Set Maprunner, Maptask is run by Maprunner, the default is Maprunnable, its function is to read a record of Inputsplit, in turn, call the Mapper map function

Setmapoutputkeyclass and Setmapoutputvalueclass: Sets the Key-value pair format for mapper output

Setoutputkeyclass and Setoutputvalueclass: Sets the key-value pair format for reducer output

Setpartitionerclass and Setnumreducetasks: Set partitioner, default to Hashpartitioner, according to the hash value of key to decide which partition to enter, Each partition is handled by a reduce task, so the number of partition equals the number of Reducetask

Setreducerclass: Set reducer, default to Identityreducer

Setoutputformat: Sets the output format for the task, which defaults to Textoutputformat

Fileinputformat.addinputpath: Set the path of the input file, you can make a file, a path, a wildcard character. Can be invoked to add multiple paths more than once

Fileoutputformat.setoutputpath: Sets the path of the output file that should not exist before the job runs

Of course not all are set, by the example above, you can write the Map-reduce program as follows:

public class Maxtemperature {

publicstatic void Main (string] args) throws IOException {

if (args.length!= 2) {

System.err.println ("Usage:maxtemperature");

System.exit (-1);

}

jobconf conf = new jobconf (maxtemperature.class);

Conf.setjobname ("Max temperature");

Fileinputformat.addinputpath (conf, new Path (args[0));

Fileoutputformat.setoutputpath (conf, new Path (args[1));

Conf.setmapperclass (Maxtemperaturemapper.class);

Conf.setreducerclass (Maxtemperaturereducer.class);

Conf.setoutputkeyclass (Text.class);

Conf.setoutputvalueclass (Intwritable.class);

Jobclient.runjob (conf);

}

}

3, Map-reduce Data flow

The process of map-reduce mainly involves the following four parts:

Client-side: For submitting Map-reduce Task job

Jobtracker: Coordinates the operation of the entire job, which is a Java process whose main class is Jobtracker

Tasktracker: The task that runs this job, processing input split, is a Java process, its mainclass is Tasktracker

Hdfs:hadoop Distributed File system for sharing job-related files across processes

  

3.1. Task Submission

Jobclient.runjob () Creates a new Jobclient instance and calls its Submitjob () function.

Request a new job ID to the Jobtracker

Detect output configuration for this job

Calculate the input splits for this job

Copy the resources required by the job to a folder in the Jobtracker file system, including Jobjar files, job.xml configuration files, input splits

Notifies Jobtracker that the job is ready to run

After the task is submitted, runjob polls the job's progress every second, returning the progress to the command line until the task is finished running.

3.2. Initialization of the task

When Jobtracker receives a submitjob call, the task is placed in a queue, and the job scheduler gets the task from the queue and initializes the task.

Initialization first creates an object that encapsulates the tasks, status, and progress that the job runs.

Before creating a task, the job scheduler first obtains the jobclient computed input splits from the shared file system.

It creates a map task for each input split.

Each task is assigned an ID.

3.3. Task Assignment

Tasktracker periodically sends heartbeat to Jobtracker.

In Heartbeat, Tasktracker tells Jobtracker that it is ready to run a new task,jobtracker that will be assigned to a task.

Before Jobtracker selects a task for Tasktracker, Jobtracker must first select a job by priority and select a task in the highest-priority job.

Tasktracker has a fixed number of positions to run a map task or a reduce task.

The default scheduler treats the map task prior to the reduce task

When you select the reduce task, Jobtracker does not choose between multiple tasks, but instead takes one directly because Reducetask does not have the concept of data localization.

3.4. Task execution

Tasktracker is assigned a task, this task is run below.

First, Tasktracker copies the jar of this job from the shared file system to the Tasktracker file system.

Tasktracker copies the files needed to run the job from the distributed cache to the local disk.

Second, it creates a local working directory for each task, extracting the jar into the file directory.

Third, it creates a taskrunner to run the task.

Taskrunner creates a new JVM to run the task.

The child JVM and Tasktracker communication created to report the progress of the run.

The process of 3.4.1 and map

Maprunnable reads a record from the inputsplit and then calls the Mapper map function sequentially to output the result.

The map's output is not written directly to the hard disk, but writes it to the cache memory buffer.

When the data in the buffer reaches a certain size, a background thread begins to write the data to the hard disk.

Before writing to the hard disk, the data in memory is divided into multiple partition through Partitioner.

In the same partition, the background thread sorts the data in memory according to key.

Each time the data is flush from memory to the hard disk, a new spill file is generated.

When this task ends, all spill files are merged into an entire, partition, and ordered file.

Reducer can request a map's output file through the HTTP protocol, Tracker.http.threads can set the number of HTTP service threads.

The process of 3.4.2 and reduce

When the map task ends, its notification Tasktracker,tasktracker notifies Jobtracker.

For a job,jobtracker know the correspondence between Tasktracer and map output.

In reducer, a thread periodically requests the location of the map output from the Jobtracker until it obtains all the map outputs.

The reduce task requires all of the map outputs of its corresponding partition.

The copy process in the reduce task starts copying the output at the end of each map task, because different maptask finish times are different.

The reduce task has multiple copy threads that can copy the map output in parallel.

When a lot of the map output is copied to the reduce task, a background thread merges it into a large, orderly file.

When all the map outputs are copied to the reduce task, enter the sort process, merging all the map outputs into large, sorted files.

Finally, enter the reduce process, call the reducer reduce function, handle each key of the ordered output, and the final result is written to HDFs.

3.5. End of Task

When Jobtracker gets a successful report of the last task's run, the job status is changed to success.

When jobclient polls from Jobtracker, it discovers that the job has ended successfully, then prints the message to the user and returns from the Runjob function.

If you do not understand, please call 10010 or 10086, turn Hezhejiang.

Original link: http://blog.csdn.net/u011340807/article/details/24630467

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.