Document directory
- 3.4.1. Map process
- 3.4.2 Reduce Process
1. logical process of Map-Reduce
Assume that we need to process a batch of weather data in the following format:
- Storage by ASCII code, one record per line
- Each line starts from 0 and ranges from 15th to 18th characters to year.
- The temperature ranges from 25th to 29th characters, of which 25th characters are symbols + /-
0067011990999991950051507+ 0000+ 0043011990999991950051512+ 0022+ 0043011990999991950051518-0011+ 0043012650999991949032412+ 0111+ 0043012650999991949032418+ 0078+ 0067011990999991937051507+ 0001+ 0043011990999991937051512-0002+ 0043011990999991945051518+ 0001+ 0043012650999991945032412+ 0002+ 0043012650999991945032418+ 0078+ |
Now we need to calculate the highest temperature each year.
Map-Reduce involves two steps: Map and Reduce.
Each step has a key-value pair as the input and output:
- The format of the key-value pair in the map stage is determined by the input format. If it is the default TextInputFormat, each row is processed as a record process, the key indicates the start position of the line relative to the start position of the file. value indicates the character text of the line.
- The format of the output key-value pair in the map stage must be the same as that of the Input key-value pair in the reduce stage.
For the above example, In the map process, the input key-value pair is as follows:
(0, 006701199099999)1950051507+ 0000+) (33,004 3011990999991950051512+ 0022+) (66,004 3011990999991950051518-0011+) (99,004 3012316999991949032412+ 0111+) (132,004 3012316999991949032418+ 0078+) (165,006 701199099999)1937051507+ 0001+) (198,004 3011990999991937051512-0002+) (231,004 3011990999991945051518+ 0001+) (264,004 3012316999991945032412+ 0002+) (297,004 3012316999991945032418+ 0078+) |
In the map process, the key-value pair of year-temperature is obtained by parsing each line of strings as the output:
(1950, 0) (1950, 22) (1950,-11) (1949,111) (1949, 78) (1937, 1) (1937,-2) (1945, 1) (1945, 2) (1945, 78) |
In the reduce process, map output is put into the same list according to the same key as the reduce input.
(1950, [0, 22,-11]) (1949, [111, 78]) (1937, [1,-2]) (1945, [1, 2, 78]) |
In the reduce process, select the maximum temperature in the list, and use the key-value of the Year-maximum temperature as the output:
(1950, 22) (1949,111) (1937, 1) (1945, 78) |
The logical process can be represented as follows:
2. Compile the Map-Reduce Program
To compile a Map-Reduce program, we generally need to implement two functions: map function in mapper and reduce function in cer Cer.
Generally, the following format is used:
- Map: (K1, V1)-> list (K2, V2)
Public interface Mapper <K1, V1, K2, V2> extends JobConfigurable, Closeable { Void map (K1 key, V1 value, OutputCollector <K2, V2> output, Reporter reporter) Throws IOException; } |
- Reduce: (K2, list (V)-> list (K3, V3)
Public interface Reducer <K2, V2, K3, V3> extends JobConfigurable, Closeable { Void reduce (K2 key, Iterator <V2> values, OutputCollector <K3, V3> output, Reporter reporter) Throws IOException; } |
For the above example, the mapper is implemented as follows:
Public class MaxTemperatureMapper extends MapReduceBase implements Mapper <LongWritable, Text, Text, IntWritable> { @ Override Public void map (LongWritable key, Text value, OutputCollector <Text, IntWritable> output, Reporter reporter) throws IOException { String line = value. toString (); String year = line. substring (15, 19 ); Int airTemperature; If (line. charAt (25) = '+ '){ AirTemperature = Integer. parseInt (line. substring (26, 30 )); } Else { AirTemperature = Integer. parseInt (line. substring (25, 30 )); } Output. collect (new Text (year), new IntWritable (airTemperature )); } } |
The CER implementation is as follows:
Public class MaxTemperatureReducer extends MapReduceBase implements extends CER <Text, IntWritable, Text, IntWritable> { Public void reduce (Text key, Iterator <IntWritable> values, OutputCollector <Text, IntWritable> output, Reporter reporter) throws IOException { Int maxValue = Integer. MIN_VALUE; While (values. hasNext ()){ MaxValue = Math. max (maxValue, values. next (). get ()); } Output. collect (key, new IntWritable (maxValue )); } } |
To run the er and Reduce implemented above, you need to generate a Map-Reduce task, which includes the following three parts:
- Input data, that is, the data to be processed
- Map-Reduce program, that is, the Mapper and Reducer implemented above
- JobConf
To configure JobConf, you need to have a general understanding of the basic principles of Hadoop job running:
- Hadoop divides jobs into tasks for processing. There are two types of tasks: map task and reduce task.
- Hadoop has two types of nodes to control job running: JobTracker and TaskTracker.
- JobTracker coordinates the operation of the entire job and assigns tasks to different TaskTracker
- TaskTracker runs the task and returns the result to JobTracker.
- Hadoop divides input data into fixed-size blocks, which are called input split.
- Hadoop creates a task for each input split and processes records in the split in sequence)
- Hadoop tries its best to make the DataNode where the input data block is located and the DataNode executed by the task (each DataNode has a TaskTracker) the same, which can improve the running efficiency, therefore, the size of input split is generally the size of HDFS blocks.
- The input of a Reduce task is generally the output of a Map Task, and the output of a Reduce Task is the output of the entire job, which is stored on HDFS.
- In reduce, all records of the same key will be run on the same TaskTracker. However, different keys can be run on different TaskTracker, which is called partition.
- The partition rule is: (K2, V2)-> Integer, that is, a partition id is generated based on K2. K2 with the same id enters the same partition, it is processed by the same CER on the same TaskTracker.
Public interface Partitioner <K2, V2> extends JobConfigurable { Int getPartition (K2 key, V2 value, int numPartitions ); } |
This section describes the basic principle of Map-Reduce Job operation:
Next we will discuss JobConf, which has many items that can be configured:
- SetInputFormat: Set the input format of map. The default format is TextInputFormat, key is LongWritable, and value is Text.
- SetNumMapTasks: set the number of map tasks. This setting usually does not work. The number of map tasks depends on the number of input split data.
- SetMapperClass: sets Mapper. The default value is IdentityMapper.
- SetMapRunnerClass: sets MapRunner. map tasks are run by MapRunner. The default value is MapRunnable. The function is to read records of input split and call map functions of Mapper in sequence.
- SetMapOutputKeyClass and setMapOutputValueClass: set the format of the key-value Pair output by Mapper.
- SetOutputKeyClass and setOutputValueClass: set the format of the key-value Pair output by CER Cer.
- SetPartitionerClass and setNumReduceTasks: Set Partitioner. The default value is HashPartitioner. The partition is determined based on the hash value of the key. Each partition is processed by a reduce task, so the number of partition is equal to the number of reduce
- SetReducerClass: sets Reducer. The default value is IdentityReducer.
- SetOutputFormat: set the task output format. The default value is TextOutputFormat.
- FileInputFormat. addInputPath: sets the path of the input file, which can be a file, a path, and a wildcard. Can be called multiple times to add multiple paths
- FileOutputFormat. setOutputPath: Specifies the path of the output file. This path should not exist before the job runs.
Of course, you do not need to set all of them. In the above example, you can write the Map-Reduce program as follows:
Public class MaxTemperature { Public static void main (String [] args) throws IOException { If (args. length! = 2 ){ System. err. println ("Usage: MaxTemperature <input path> <output path> "); System. exit (-1 ); } JobConf conf = new JobConf (MaxTemperature. class ); Conf. setJobName ("Max temperature "); FileInputFormat. addInputPath (conf, new Path (args [0]); FileOutputFormat. setOutputPath (conf, new Path (args [1]); Conf. setMapperClass (MaxTemperatureMapper. class ); Conf. setReducerClass (MaxTemperatureReducer. class ); Conf. setOutputKeyClass (Text. class ); Conf. setOutputValueClass (IntWritable. class ); JobClient. runJob (conf ); } } |
3. Map-Reduce data flow)
The process of Map-Reduce mainly involves the following four parts:
- Client: Used to submit a Map-reduce task job
- JobTracker: coordinates the operation of the entire job. It is a Java Process, and its main class is JobTracker.
- TaskTracker: the task that runs the job and processes the input split. It is a Java Process, and its main class is TaskTracker.
- HDFS: hadoop Distributed File System, used to share Job-related files among various processes
3.1 submit a task
JobClient. runJob () creates a new JobClient instance and calls its submitJob () function.
- Request a new job ID from JobTracker
- Check output configuration of this job
- Calculate the input splits of this job
- Copy the resources required for Job running to the folder in the file system of JobTracker, including the job jar file, job. xml configuration file, and input splits.
- Notify JobTracker that the Job is ready to run
After a job is submitted, the runJob polls the progress of the job every second, and returns the progress to the command line until the task is completed.
3.2 task Initialization
When JobTracker receives a submitJob call, it puts the task in a queue. The job scheduler obtains the task from the queue and initializes the task.
First, create an object to encapsulate the job running tasks, status, and progress.
Before creating a task, the job scheduler first obtains the input splits calculated by JobClient from the shared file system.
Creates a map task for each input split.
Each task is assigned an ID.
3.3. Task Allocation
TaskTracker periodically sends heartbeat to JobTracker.
In heartbeat, TaskTracker informs JobTracker that it is ready to run a new task, and JobTracker assigns it a task.
Before JobTracker selects a task for TaskTracker, JobTracker must first select a Job based on the priority and select a task from the Job with the highest priority.
TaskTracker has a fixed number of locations to run map tasks or reduce tasks.
The default scheduler takes precedence over reduce tasks for map tasks.
When selecting reduce tasks, JobTracker does not select between multiple tasks, but directly takes the next one, because reduce tasks do not have the concept of data localization.
3.4. task execution
TaskTracker is assigned a task, and the task will be run below.
First, TaskTracker copies the jar of the job from the shared file system to the file system of TaskTracker.
TaskTracker copies the files required for job running to the local disk from the distributed cache.
Create a local working directory for each task and decompress the jar file to the file directory.
Third, it creates a TaskRunner to run the task.
TaskRunner creates a new JVM to run the task.
The created child JVM communicates with TaskTracker to report the running progress.
3.4.1. Map process
MapRunnable reads records from the input split, calls the map function of ER er in sequence, and outputs the results.
Map output is not directly written to the hard disk, but to the cache memory buffer.
When the data in the buffer reaches a certain size, a background thread writes the data to the hard disk.
Before writing data to a hard disk, data in the memory is divided into multiple partitions by partitioner.
In the same partition, the background thread sorts data by keys in the memory.
Each time data is flushed from memory to hard disk, a new spill file is generated.
Before the end of this task, all the spill files are merged into an integral partition and sorted file.
Reducer can request map output files through the http protocol, and tracker. http. threads can set the number of http service threads.
3.4.2 Reduce Process
After a map task is completed, it notifies TaskTracker and TaskTracker to notify JobTracker.
For a job, JobTracker knows the correspondence between TaskTracer and map output.
A thread in cer CER periodically requests the location of map output to JobTracker until it obtains all map output.
Reduce tasks need all map outputs of their corresponding partition.
In reduce task, the copy process starts when each map task ends, because the completion time of different map tasks is different.
Reduce tasks have multiple copy threads, which can copy map output in parallel.
After many map outputs are copied to reduce tasks, a background thread combines them into a large sorted file.
After All map outputs are copied to reduce tasks, the sort process is started. All map outputs are merged into large sorted files.
Finally, enter the reduce process, call the reduce function of CER, process each key output in the sorted order, and write the final result into HDFS.
3.5. Task ended
After JobTracker obtains the success Report of the last task, it changes the job status to successful.
When JobClient polls from JobTracker and finds that the job has ended successfully, it prints the message to the user and returns the message from the runJob function.