1. Logical process of map-reduce
Suppose we need to process a batch of weather-related data in the following format:
Stored in ASCII code, one record per line
Each line of characters counts from 0, and the 15th to the 18th character Fu Weihan
The 25th to 29th character is the temperature, where the 25th digit is the symbol + +
0067011990999991950051507+0000+
0043011990999991950051512+0022+
0043011990999991950051518-0011+
0043012650999991949032412+0111+
0043012650999991949032418+0078+
0067011990999991937051507+0001+
0043011990999991937051512-0002+
0043011990999991945051518+0001+
0043012650999991945032412+0002+
0043012650999991945032418+0078+
Now you need to count the maximum temperature per year.
Map-reduce consists of two main steps: Map and reduce
Each step has a key-http://www.aliyun.com/zixun/aggregation/9541.html ">value" as input and output:
The format of the Key-value pair in the map phase is determined by the format of the input, and if the default Textinputformat, each row is processed as a record process, where key is the beginning of the line relative to the start of the file, and value is the character text of the row
The format of the key-value pairs of outputs in the map phase must correspond to the format of the input key-value pairs in the reduce phase
For the above example, in the map process, the input key-value is as follows:
(0,0067011990999991950051507+0000+)
(33,0043011990999991950051512+0022+)
(66,0043011990999991950051518-0011+)
(99,0043012650999991949032412+0111+)
(132,0043012650999991949032418+0078+)
(165,0067011990999991937051507+0001+)
(198,0043011990999991937051512-0002+)
(231,0043011990999991945051518+0001+)
(264,0043012650999991945032412+0002+)
(297,0043012650999991945032418+0078+)
In the map process, the key-value of the year-temperature is obtained by parsing the string of each line:
(1950, 0)
(1950, 22)
(1950,-11)
(1949, 111)
(1949, 78)
(1937, 1)
(1937,-2)
(1945, 1)
(1945, 2)
(1945, 78)
In the reduce process, the output in the map process is placed in the same list with the same key as the input of reduce
(1950, [0, 22,–11])
(1949, [111, 78])
(1937, [1,-2])
(1945, [1, 2, 78])
In the reduce process, select the maximum temperature in the list, and the key-value of the year-maximum temperature as the output:
(1950, 22)
(1949, 111)
(1937, 1)
(1945, 78)
Its logical process can be represented as the following figure:
The following figure probably describes the rationale for Map-reduce's job operation:
Here we discuss jobconf, which has a number of items to configure:
Setinputformat: Set input format for map, default to Textinputformat,key to Longwritable,value text
Setnummaptasks: Set the number of map tasks, this setting usually does not work, the number of map tasks depends on the number of input data can be divided into inputsplit
Setmapperclass: Set mapper, default to Identitymapper
Setmaprunnerclass: Set Maprunner, Maptask is run by Maprunner, the default is Maprunnable, its function is to read a record of Inputsplit, in turn, call the Mapper map function
Setmapoutputkeyclass and Setmapoutputvalueclass: Sets the Key-value pair format for mapper output
Setoutputkeyclass and Setoutputvalueclass: Sets the key-value pair format for reducer output
Setpartitionerclass and Setnumreducetasks: Set partitioner, default to Hashpartitioner, according to the hash value of key to decide which partition to enter, Each partition is handled by a reduce task, so the number of partition equals the number of Reducetask
Setreducerclass: Set reducer, default to Identityreducer
Setoutputformat: Sets the output format for the task, which defaults to Textoutputformat
Fileinputformat.addinputpath: Set the path of the input file, you can make a file, a path, a wildcard character. Can be invoked to add multiple paths more than once
Fileoutputformat.setoutputpath: Sets the path of the output file that should not exist before the job runs
Of course not all are set, by the example above, you can write the Map-reduce program as follows:
public class Maxtemperature {
publicstatic void Main (string] args) throws IOException {
if (args.length!= 2) {
System.err.println ("Usage:maxtemperature");
System.exit (-1);
}
jobconf conf = new jobconf (maxtemperature.class);
Conf.setjobname ("Max temperature");
Fileinputformat.addinputpath (conf, new Path (args[0));
Fileoutputformat.setoutputpath (conf, new Path (args[1));
Conf.setmapperclass (Maxtemperaturemapper.class);
Conf.setreducerclass (Maxtemperaturereducer.class);
Conf.setoutputkeyclass (Text.class);
Conf.setoutputvalueclass (Intwritable.class);
Jobclient.runjob (conf);
}
}
3, Map-reduce Data flow
The process of map-reduce mainly involves the following four parts:
Client-side: For submitting Map-reduce Task job
Jobtracker: Coordinates the operation of the entire job, which is a Java process whose main class is Jobtracker
Tasktracker: The task that runs this job, processing input split, is a Java process, its mainclass is Tasktracker
Hdfs:hadoop Distributed File system for sharing job-related files across processes
3.1. Task Submission
Jobclient.runjob () Creates a new Jobclient instance and calls its Submitjob () function.
Request a new job ID to the Jobtracker
Detect output configuration for this job
Calculate the input splits for this job
Copy the resources required by the job to a folder in the Jobtracker file system, including Jobjar files, job.xml configuration files, input splits
Notifies Jobtracker that the job is ready to run
After the task is submitted, runjob polls the job's progress every second, returning the progress to the command line until the task is finished running.
3.2. Initialization of the task
When Jobtracker receives a submitjob call, the task is placed in a queue, and the job scheduler gets the task from the queue and initializes the task.
Initialization first creates an object that encapsulates the tasks, status, and progress that the job runs.
Before creating a task, the job scheduler first obtains the jobclient computed input splits from the shared file system.
It creates a map task for each input split.
Each task is assigned an ID.
3.3. Task Assignment
Tasktracker periodically sends heartbeat to Jobtracker.
In Heartbeat, Tasktracker tells Jobtracker that it is ready to run a new task,jobtracker that will be assigned to a task.
Before Jobtracker selects a task for Tasktracker, Jobtracker must first select a job by priority and select a task in the highest-priority job.
Tasktracker has a fixed number of positions to run a map task or a reduce task.
The default scheduler treats the map task prior to the reduce task
When you select the reduce task, Jobtracker does not choose between multiple tasks, but instead takes one directly because Reducetask does not have the concept of data localization.
3.4. Task execution
Tasktracker is assigned a task, this task is run below.
First, Tasktracker copies the jar of this job from the shared file system to the Tasktracker file system.
Tasktracker copies the files needed to run the job from the distributed cache to the local disk.
Second, it creates a local working directory for each task, extracting the jar into the file directory.
Third, it creates a taskrunner to run the task.
Taskrunner creates a new JVM to run the task.
The child JVM and Tasktracker communication created to report the progress of the run.
The process of 3.4.1 and map
Maprunnable reads a record from the inputsplit and then calls the Mapper map function sequentially to output the result.
The map's output is not written directly to the hard disk, but writes it to the cache memory buffer.
When the data in the buffer reaches a certain size, a background thread begins to write the data to the hard disk.
Before writing to the hard disk, the data in memory is divided into multiple partition through Partitioner.
In the same partition, the background thread sorts the data in memory according to key.
Each time the data is flush from memory to the hard disk, a new spill file is generated.
When this task ends, all spill files are merged into an entire, partition, and ordered file.
Reducer can request a map's output file through the HTTP protocol, Tracker.http.threads can set the number of HTTP service threads.
The process of 3.4.2 and reduce
When the map task ends, its notification Tasktracker,tasktracker notifies Jobtracker.
For a job,jobtracker know the correspondence between Tasktracer and map output.
In reducer, a thread periodically requests the location of the map output from the Jobtracker until it obtains all the map outputs.
The reduce task requires all of the map outputs of its corresponding partition.
The copy process in the reduce task starts copying the output at the end of each map task, because different maptask finish times are different.
The reduce task has multiple copy threads that can copy the map output in parallel.
When a lot of the map output is copied to the reduce task, a background thread merges it into a large, orderly file.
When all the map outputs are copied to the reduce task, enter the sort process, merging all the map outputs into large, sorted files.
Finally, enter the reduce process, call the reducer reduce function, handle each key of the ordered output, and the final result is written to HDFs.
3.5. End of Task
When Jobtracker gets a successful report of the last task's run, the job status is changed to success.
When jobclient polls from Jobtracker, it discovers that the job has ended successfully, then prints the message to the user and returns from the Runjob function.
If you do not understand, please call 10010 or 10086, turn Hezhejiang.
Original link: http://blog.csdn.net/u011340807/article/details/24630467