additional mapreduce functions
figure 4.6 inserts the mapreduce data stream of combiner.
combiner: the pipeline shown above ignores a step that can optimize the bandwidth used by mapreduce jobs. This process is called combiner, which runs before CER er and reducer. Combiner is optional. If this process is suitable for your job, the combiner instance runs on each node that runs the map task. The combiner receives the Mapper instance output on a specific node as the input, and then the combiner output is sent to the reducer instead of the Mapper output. Combiner is a "Mini reduce" process that processes only the data generated by a single machine.
word frequency statistics are a basic example that shows the usefulness of combiner, the above Term Frequency Statistics Program generates a (word, 1) Key-value pair for each word it sees. Therefore, if "cat" appears three times in the same document, ("cat", 1) the key-value pair is generated three times, these key-value pairs are sent to Cer. By using combiner, these key-value pairs can be compressed into a key-value pair sent to Cer CER ("cat", 3 ). Now, each node sends only one value to Cer CER for each word, greatly reducing the bandwidth required for the shuffle process and accelerating job execution. The best thing here is that we can enjoy this function without writing any additional Code ! If your reduce is interchangeable and composite, it can also be used as a combiner. You only need to add the following line of code to the driver to enable combiner in the Word Frequency Statistics Program.
Conf. setcombinerclass (reduce. Class);
Combiner should be an instance of the reducer interface. If your reducer cannot be exchanged or cannot be combined as a combiner, you can still write a third-party class as the combiner of your job.
Fault Tolerance
One of the main reasons for using hadoop to run your job is its high fault tolerance. Even if the job runs in a large cluster composed of nodes or networks with high failure rates, all hadoop jobs can be completed successfully.
The main method for implementing Fault Tolerance in hadoop is to re-execute the task. A single task node (tasktracker) will continuously communicate with the system's core node (jobtracker, if a tasktracker cannot communicate with jobtracker within a certain period of time (the default value is 1 minute), jobtracker assumes that the tasktracker has a problem, jobtracker understands the map and reduce tasks assigned to each tasktracker.
If the job is still in the mapping stage, other tasktrackers are required to re-execute all MAP tasks executed by the previous failed tasktracker. If the job is in the reduce stage, other tasktrackers are required to re-execute all reduce tasks executed by the previous failed tasktracker.
Once the reduce task is completed, data is written to HDFS. Therefore, if a tasktracker has completed two of the three reduce tasks assigned to it, only the third task will be re-executed. Map tasks are more complex: even if a node has completed 10 map tasks, reducer may still be unable to get all the output of these map tasks. If the node crashes at this time, its mapper output will be inaccessible. Therefore, the completed map tasks must be re-executed to make their output results available to the remaining compaction machines. All these tasks are automatically performed by the hadoop platform.
This fault tolerance emphasizes that the execution of the program has no side effects. If Mapper and reducer have their own identifiers and communicate with the outside, re-execution of a task may require other nodes to communicate with the new map or reduce task instance, and the restarted tasks may need to re-establish their intermediate states. This process is complex and error-prone. Mapreduce greatly simplifies this problem by removing the task identifier or communication between tasks. A single task can only see its own input and output, which makes the error and restart processes clear and reliable.
Speculative execution (speculative execution ): The hadoop system has a problem. It distributes tasks to many nodes. Among them, some slow nodes may limit the execution speed of the remaining programs. For example, if a node has a slow disk controller, it may only read 10% of the input data speed of all other nodes. So when 99 map tasks have been completed, the system is still waiting for the last time-consuming map task to complete.
By forcing tasks to run independently on other tasks, a single task does not know where their input data comes from. The task is believed to have sent appropriate input to the hadoop platform. Therefore, for the same input data, we can process it multiple times in parallel to utilize the load capacity of different machines. Because most of the tasks in the job have been completed, the hadoop platform schedules copies of the remaining tasks on several idle nodes. This process is called speculative execution. When the task is completed, it notifies jobtracker. Any first completed copy task will become an authoritative copy. If other copy tasks are still being tested, hadoop will tell tasktracker to terminate these tasks and discard their output, then the reducer obtains the input data from the mapper.
Disable cution is false to disable speculative execution of Mapper and reducer.