1 Hadoop Pipeline Improvement ideas
In the implementation of the Hadoop system, the output data from the map end is first written to the local disk, the Jobtracker is notified when the native task is completed, and then the reduce end sends an HTTP request after receiving the Jobtracker notification. Pull back the output from the corresponding map end using the Copy method. This can only wait for the map task to complete before the reduce task begins, and the execution of the map task and the reduce task is detached.
Our idea of improvement is to make the map task and reduce task work in a piped way, when the map task starts to produce output and sends it directly to the appropriate reduce task, which requires the Jobtracker to assign the map task and reduce task after the user submits the job. and send the location of each map task to the reduce task. After each reduce task starts, contact each task and open a socket. When each map output is generated, Mapper decides to send the partition (reduce task) and sends it directly via the appropriate socket.
The reduce task receives the pipeline data received from each map task and stores it in a memory buffer, and writes the sorted buffer contents to disk as needed. Once the reduce task learns that each map task is complete, it performs the final merge of the sorted content and then invokes the user-defined reduce function to write the output to HDFs.
2 Problems and Solutions
In the above improvement thought faces the following several practical problems, we will carry on the analysis and proposes the solution method.
(1) Hadoop systems may not have the most of the task slots available to schedule jobs.
The map output could not be sent directly due to the limitation of the task slot and the possibility that part of reduce has not been scheduled. The improved method is to write the output of this part of the map to disk, and when the reduce task is assigned to the task slot, it replicates the data from the map task like the Hadoop system.
(2) A large number of TCP connections are required to open the socket between each map task and the reduce task.
A large number of TCP will take up too much network bandwidth, easy to cause network congestion. To reduce the number of concurrent TCP connections, each reducer can be configured to channel data from a limited number of mapper, and the rest of the data is pulled back from mapper in the traditional way of the Hadoop system.
(3) Call the map function and use the same thread to write the output to the pipe socket.
This may lead to a situation where network I/O is blocked due to reducer overuse, and Mapper cannot do useful work. The improved method is to run the map function with a separate thread, store its output to a memory buffer, and then periodically send the buffer content to the reducer end of the pipeline by another line.
(4) The map task sends the resulting data eagerly, blocks the use of combiner, and moves some sort work from mapper to reducer.
Map tasks that generate data are routed to the corresponding reduce without the opportunity to apply the COMBINER function, increasing network traffic. At the same time, the map's sort process is more likely to be transferred to the reduce task to reduce the response time and bring significant overhead because the number of map tasks is usually much larger than the reduce task. The improved method is to modify the memory buffer design, not directly send the buffer content to the reducer, but wait until the buffer growth to the threshold size after the application of the Combiner function, sorted by partition and key value after the content overflow to disk. The second thread monitors the overflow file and sends it to the reducer in a piped manner. If Reducer can catch up with mapper and the network is not a bottleneck, the overflow file is sent to reducer immediately after it is generated. Otherwise, the overflow file will gradually increase, and mapper periodically apply the Combiner function to merge multiple overflow files into a single large file.
3 implementation of the improved system
UC Berkeley Tyson Condie and other people based on MapReduce online thesis realized the Hadoop online Prototype (HOP) [13] system, in addition to the operation within the mapper to reducer pipeline, The "snapshot" technology is also used to implement pipeline execution between jobs reducer to mapper. Hop also supports continuous querying, which enables MapReduce programs to be used for real-time applications such as event monitoring and streaming. At the same time, hop retains the fault-tolerant features of Hadoop and is able to run unmodified user-defined MapReduce programs.
The data stream implemented by hop is compared with the Hadoop system as shown in the following illustration:
In Hadoop-0.19.2, the org.apache.hadoop.mapred package implements the Hadoop mapreduce idea, Hop adds a Org.apache.hadoop.mapred.bufmanager package to manage the input and output of the map and reduce tasks. The main classes included are as shown in the following table:
The hop system can successfully run the MapReduce job on pseudo distributed, but when the WordCount application is executed after deployment in the cluster, when the map phase completes and the reduce phase completes 25%, the job long stagnation cannot continue, displaying the error shown in the following illustration:
We refer to the Hop program to modify the Hadoop-0.19.2, and use Ant compilation, the success of the implementation of the same as hop, the same in the case of cluster execution MapReduce job in the process of stagnation. Analysis of the reasons, if the hop system itself is not the implementation of the problem, it may be the configuration of the experimental cluster or network problems, but the specific reasons have not been found and resolved.
The test experiment based on Hadoop system optimization is performed using hop system, which enables the map process and the reduce process pipeline to execute. According to the MapReduce online paper, the authors performed a performance test experiment on the Amazon EC2 using a 60-node cluster and sorted the 5.5GB data extracted from Wikipedia, as shown in the following figure:
Experimental results show that hop is more advantageous than Hadoop, which greatly reduces the job completion time and has higher system utilization.
However, because of HOP error in the cluster, in order to verify its optimization effect, using pseudo distributed execution WordCount program, by comparing with the original program on the Hadoop system, the execution time is 314 seconds (HOP) and 266 seconds (Hadoop) respectively. Both the map process and the reduce process are shown in Figure 1 and figure 2 below. By comparison, it is found that the hop system does realize the map process and the pipeline execution of the reduce process, but the hop system is longer in the execution time, which is different from the comparison analysis chart of the above diagram. It may be realized by HOP system, cluster number and configuration, data processing and other factors. But hop this kind of improvement thought is worth learning and draw lessons from.
"Recommended reading": 1. Using MapReduce and load balancing in the cloud
2.Hadoop Technology Center