A large number of efficient mapreduce programs are generated by simple writing methods: in addition to preparing input data, programmers only need to implement Mapper and ruducer interfaces, or add combiner) and distributor (partitioner ). All other aspects of the execution are transparently controlled in the execution framework of a cluster consisting of one node to thousands of nodes with a data level of GB to Pb. However, this means that the algorithm implemented above must be represented as strictly defined components and must be integrated in a special way. A large number of algorithms are not easy to convert into this programming model. This chapter mainly aims to provide some examples of mapreduce algorithm design. These examples illustrate the mapreduce design model, Sample Preparation of components, and special technologies to solve different problems in various fields. The two design modes will be displayed in the Scalable reverse Index Algorithm in Chapter 4. the concepts presented in this chapter will also be shown in chapter 5 (graph processing) and chapter 6 (obtaining the expected maximum algorithm).
Synchronization may be the most tricky part in designing mapreduce algorithms (generally parallel and distributed algorithms. Unlike the embarrassing parallel problem, processes running on different nodes in the same cluster must arrive at certain node sets in time. For example, some results are allocated to the desired node. In a simple mapreduce job, there is only one chance to synchronize the entire cluster ----- in shuffle and sort) when a key-value pair in the stage is copied from mappers to javascers by key classification. In addition to the above cases, mappers and Reducers run independently, and there is no mechanism for them to communicate directly. In addition, there are few aspects that programmers can control during program execution, such:
Where does Mapper and reducer run? (On that node)
When does Mapper and reducer start or end?
Which mapper processes an Input key-value pair?
Which of the following reducers processes an intermediate key-value pair?
However, some technologies can be used to control the execution and management of mapreduce data streams. In general, they are:
1. construct complex data structures as keys and values to store and interact with some results.
2. Run the User-Defined initialization code before the map or reduce task starts and the User-Defined termination code before the map or reduce task ends.
3. Maintain the status when mappers and Reducers Input Multiple Input or intermediate values.
4. Control the sorting of the median so that CER encounters a specific key.
5. Control the distribution of key space so that CER will encounter a specific key.
Realize that many algorithms cannot be simply converted into a mapreduce job. It must frequently break down the composite algorithm into a job sequence and convert the output data in a job into the input data of the next job. Many algorithms are essentially iterative, and they need to be computed repeatedly until some convergence criteria are reached-this is the case with the Graph Algorithm in Chapter 5 and the maximum Expectation Algorithm in chapter 6. In most cases, the self-check of convergence is not easy to implement in mapreduce. A common solution is to use an external (non-mapreduce) program as a drive to coordinate mapreduce iteration.
This chapter explains how to use various technologies in mapduce to control code execution and data streams that can be used to design algorithms. Focus on scalability-to ensure that algorithms have no bottlenecks and are efficient when applied to an ever-increasing data set-to ensure that algorithms do not reduce the value of parallelism because they consume additional resources. The golden rule, of course, is linear scalability: The algorithm takes two times of time to execute twice the data size. Similarly, the execution time of a node that doubles should be halved.
The composition of this chapter is as follows:
Section 3.1 describes important concepts and strategies of partial aggregation in mapreduce to design efficient algorithms to minimize the amount of data that must be transmitted over the network. The rational use of the combiner and the in-mapper combining mode will also be discussed in detail.
Section 3.2 describes two common design modes, called pairs and strips, by setting up the word co-occurrence matrix in the big text collection. These two methods are useful when you need to track common events through a large number of observations.
Section 3.3 shows how co-occurrence counts converts a pattern to a relevant frequency. The computing order in cer CER is converted into a sorting problem, and the intermediate data is sorted in order for a series of calculations. A reducer usually needs to calculate the total statistics of a set of elements before individual elements are processed. In general, this will be ignored, but for the "Reverse Order" mode, the total number of statistics should be calculated before the arrival of individual elements. This would be regarded as an intuitive violation: how can we calculate the total statistics of a set of elements before individual elements are processed? As a result, intelligent sorting of special key-value pairs can be achieved.
Section 3.4 provides a conventional solution for quadratic sorting. The key to the problem is to sort the values corresponding to keys in the reduce stage. We call this technology "value-to-key conversion )".
Section 3.5 describes how to execute joins in related datasets and demonstrate three different methods: reduce-side connection, map-side connection, and memory-based connection.