MapReduce connection operations can be used in the following scenarios:
Aggregation of the demographic information of the user (e.g., differences in habits between teens and middle-aged people).
When users do not use the site for a certain amount of time, email them to remind them. (This threshold for a certain time is the user's own predefined)
Analyze user browsing habits. The system can be based on this analysis to prompt the user what web site features have not yet been used. And then form a feedback loop.
All of these scenarios require that you connect multiple datasets.
The two most commonly used connection types are inner joins (inner join) and outer joins (outer join). As the following illustration shows, the inner joins compare all the tuples in the two relationships, determine if the join condition is met, and then generate a result set that satisfies the join condition. In contrast to an inner connection, an outer join does not require a tuple of two relationships to satisfy the join condition. When the join condition is not satisfied, the outer join can keep the data of one party in the result set.
In order to achieve both inner and outer joins, there are three connection strategies in MapReduce, as shown below. Some of these three connection strategies are in the map phase and some in the reduce phase. They are optimized for mapreduce sorting-merging (sort-merge) architectures.
Heavy-Partition connection (repartition join)--reduce end connection. Working with scenarios: connecting two or more large datasets.
Replication connection (Replication join)--map end connection. Usage Scenario: One dataset in the dataset to be connected is small enough to be fully placed in the cache.
Semi-connected (semi-join)--Another map-side connection. Usage Scenario: There is a large dataset in the dataset to be connected, but at the same time the dataset can be filtered into a small cache.
After you've covered these connection policies, another strategy is introduced: the decision tree. The optimal strategy can be chosen according to the actual situation.
4.1.1 Heavy partition connection (repartition join)
A heavy partition connection is a reduce-end connection. It uses the MapReduce sorting-merge mechanism to group data. It uses only a single MapReduce task and supports multiple connections (N-way join). Multiplexing refers to multiple datasets.
The map phase is responsible for reading data from multiple datasets, determining the connection value for each data, and the connection value as the output key. The output value (outputs http://www.aliyun.com/zixun/aggregation/9541.html ">value") contains the values that will be merged in the reduce phase.
In the reduce phase, a reduce receives all output values from each output key from the map function and divides the data into multiple partitions. After that, reduce the Cartesian product (Cartersian product) concatenation operation for all partitions and generate all the result sets.
The above MapReduce process is shown in Figure 4.2:
Note: filtration (filtering) and projection (projection)
In MapReduce heavy partitioning connections, it is best to reduce the amount of data in the map phase that is transferred to the reduce phase. Because sorting and transferring data over the network during these two phases can be costly. If you can't avoid the work of the reduce end, a best practice is to filter the data and projections in the map phase as much as possible. Filtering refers to discarding unwanted parts of the map's extreme input data. Projection is the concept of relational algebra. Projections are used to reduce the fields sent to reduce. For example, when analyzing user data, if you only focus on the age of the user, you should only project (or output) The age field in the map task, regardless of the user's other fields.
Technology 19: Optimizing the partition connection
The Hadoop in action gives an example of how to use Hadoop's Community Pack (contrib package) Org.apache.hadoop.contrib.utils.join to implement a partitioned connection. This contribution package is packaged with all the processing details, just a very simple way to implement it.
However, this community packet has a low spatial efficiency in implementing the zoning. It needs to read all of the output values to be connected to memory, and then make a multilink connection (multiway join). In fact, it's more efficient to just read a small dataset into memory and then use a small dataset to traverse a large dataset to connect.
Problem
You need to repartition the connection in MapReduce, but you do not want to put all the data in the cache during the reduce phase.
Solution
This technique utilizes the optimized zoning framework. It simply places a data set to be connected in the cache, reducing the data that reduce needs to be placed in the cache.
Discussion
The implementation of the optimized repartition framework is described in Appendix D. This implementation is modeled on the Org.apache.hadoop.contrib.utils.join community package. This optimized framework caches only the smaller of the two datasets to reduce memory consumption. Figure 4.3 is the flowchart for the optimized repartition connection:
Using this connection framework requires the implementation of abstract classes Optimizeddatajoinmapperbase and Optimizeddatajoinreducerbase.
For example, you need to connect user details data and user activity logs. The first step is to determine which of the two data sets is smaller. For the general site, the user details data will be relatively small, user activity log will be relatively large.
In the following example, user data has user name, age, and state
User activity log has user name, action, source IP. This file is generally much larger than the user data.
First, abstract class Optimizeddatajoinmapperbase must be implemented. This will be invoked at the map end. This class creates the output key and output value of the map. At the same time, it will also prompt the entire framework, the current processing of the file is not the smaller one.
Next, you need to implement the abstract class Optimizeddatajoinreducerbase. It will be invoked at the reduce end. In this class, the output keys and output values of different datasets are passed in from the map side, and then the output array of the reduce end is returned.
Finally, the task's main code (driver code) needs to indicate the InputFormat class and set the secondary sort (secondary sort).
Now that the connection is ready, you can begin to run the connection:
If the connection is relative to the source file, you can see that because the implementation of an internal connection, the output does not include the user Anne,alison and other records do not exist in the log file.
Summary:
The implementation of this connection improves the efficiency of the Hadoop community package by caching only smaller datasets. However, when data is transferred from the map phase to the reduce phase, high network costs are still generated.
In addition, the Hadoop Community Pack supports multiple connections, where implementations only support two-way connections.
If you want to reduce the memory footprint of the reduce end connection more (memory footprint), a simple mechanism is to have more projection operations in the map function. The projection reduces the fields in the output of the map phase. For example, when analyzing user data, if you only focus on the age of the user, you should only project (or output) The age field in the map task, regardless of the user's other fields. This reduces the network burden between map and reduce and reduces the memory consumption of reduce when connected.
Like the original community pack, the implementation of the zoning here also supports filtering and projection. Filtering can be supported by allowing the Genmapoutputvalue method to return null values. The projection can be supported by defining the contents of the output value in the Genmapoutputvalue method.
If you want to output all of the data to reduce and want to avoid the loss of sorting, you need to consider two other connection strategies, replication connection and semi connection.
Appendix D optimized MapReduce Connection framework
In this appendix, we will discuss the two connection frames used in the 4th sheet. The first is the reconnection framework. It reduces the memory footprint of the Org.apache.hadoop.contrib.utils.join bundle's implementation of the Hadoop connection. The second is the replication connection framework. It can place smaller datasets in the cache.
D.1-Optimized heavily partitioned framework
The Hadoop Community Connection pack needs to read all the values of each key into memory. How can I reduce the memory overhead at the reduce end of the connection? In the optimizations provided in this article, only the smaller datasets need to be cached, and then the data in the larger dataset is traversed in the connection. This method also includes a secondary ordering of the output data for the map, so reducer receives a smaller dataset before receiving a larger dataset. Figure D.1 is the flowchart of this process.
Figure D.2 is a class diagram of implementations. The class diagram contains two parts, a common framework, and examples of implementations of some classes.
Connection frame
We write the code for the connection in a similar fashion to the Hadoop community connection package. The goal is to create a universal partition mechanism that can handle arbitrary datasets. For the sake of brevity, we will highlight the main parts.
The first is the Optimizeddatajoinmapperbase class. The function of this class is to identify a smaller dataset and generate output keys and output values. The Configure method is invoked when the mapper is created. One of the roles of the Configure method is to identify each dataset so that reducer can differentiate between the source datasets of the data. Another role is to identify whether the current input data is a smaller dataset.
The map method first invokes the custom method (Generatetaggedmapoutput) to generate the Outputvalue object. This object contains the values that need to be used in the connection (which may also contain the value of the final output), and a Boolean value that identifies a larger or smaller dataset. If the map method can invoke a custom method (Generategroupkey) to obtain a key that can be used in the connection, the key is used as the output key of the map.
Figure D.3 illustrates the key combination of the map output (composite can) and the combined value. The secondary sort will be partitioned based on the join key, and the entire key combination is sorted. A key combination includes an orthopedic value that identifies the source dataset (larger or smaller), so that the value of the smaller source dataset is guaranteed to be reduced before the value of the larger source data is received.
The next step is to deepen reduce. Previously, the value of the smaller source dataset was guaranteed to be received before the value of the larger source dataset. Here you can put the values of all the smaller source datasets in the cache. When you begin to receive the value of a larger source dataset, connect the values in the start and cache.
The method Joinandcollect contains the values of two datasets and outputs them.
These are the main elements of this framework.