3.5 relational joins)
A popular application field of hadoop is data warehouse. In an enterprise-level environment, a data warehouse is used as a storage location for large amounts of data, storing almost all the information from sales transactions to product lists. Generally, the data is related, but as the data grows, the Data Warehouse is used to store semi-structured data (for example, querying logs) as the unstructured data is stored ). A data warehouse forms the basis for business intelligence applications that provide decision-making support. It is generally believed that knowledge is obtained by mining historical and current data, and the predicted data can bring about a competitive advantage in the market.
Generally, data warehouses have been implemented through relational databases, especially Online Analytical Processing (OLAP) after optimization ). Multiple dimensions appear in parallel databases, but users find that they cannot use the data to be processed by a system today to measure cost effectiveness. Parallel databases are often very expensive-several terabytes of user data. After years of development, hadoop has become a popular data warehouse. Hammerbacher [68], talked about Facebook's building of business intelligence applications on Oracle databases, and later gave up, because he liked to use his own hadoop-based hive (now an open-source project ). Pig [114] is a platform built with hadoop for massive data analysis and can process structured data like semi-structured data. It was originally developed by Yahoo, but now it is an open-source project.
If you want to use hadoop to build a successful data warehouse and complex analysis and query applications in a sample environment, it makes sense to view how mapreduce controls the relevant data. This section focuses on how to implement related connections in mapreduce. We want to emphasize that hadoop is not a database although it has been applied to process related data. The more advantageous debate between parallel databases and mapreduce In the OLAP application environment continues. DeWitt and Stonebraker, two well-known figures in the database community, mentioned in the blog that mapreduce is a huge regression 11. As colleagues, they have made a series of arguments that the row-oriented parallel database is better than hadoop's benchmark test [120,144]. However, let's look at Dean's and Ghemawat's counterargument [47] and the recently attempted hybrid architecture [1].
We must stop discussing this dynamic thing, instead of focusing on algorithms. From the perspective of an application, it is very likely that a data warehouse-based data analysis does not need to write mapreduce programs (in fact, hadoop-based hive and pig use a more advanced language to process large-scale data ). However, it is also helpful to understand the basic algorithms that constitute it.
This section describes the connection policies for three different datasets, generally named S and T. Let's assume that the link S is like the following:
(K1, S1, S1)
(K2, S2, S2)
(K3, S3, S3)
...
K is the key we want to connect, and Sn is the unique ID of the array. The Sn AFTER Sn represents other attributes in the array (not important in the connection ). Similarly, assume that the link T is as follows:
(K1, T1, T1)
(K3, T2, T2)
(K8, T3, T3)
...
K is the connection key, TN is the unique ID of the array, and TN after tn represents other attributes in the tuples.
To make this task more specific, let's look at a possible situation in reality: S represents a collection of user data, and K serves as the primary key (for example, the user's ID number) in this example ). This tuples may contain demographic data such as age, gender, and income. Another dataset, T, represents the user's online operation logs. Each tuple is equivalent to the webpage URL and additional information contained in the user's browsing on a page, such as the time spent on the webpage and the advertisement revenue. K can be considered as a foreign key for users to browse datasets on different webpages. You can connect these two datasets for analysis. For example, you can terminate a user's network activity (such as malicious attacks) based on user characteristics ).
3.5.1 reduce end connection
The first related connection method is to connect in reduce. This idea is very simple: we traverse two datasets and then send the connection key as the intermediate key, and the tuples themselves as values. Because mapreduce ensures that all values of the same key are centralized, and all tuples are grouped by the connection key-this is not necessary in our connection operations. This method is called Parallel sort-merge join in the database community. In details, there are three situations to consider.
The first is the simplest one-to-one connection. the maximum number of tuples in S shares the same connection key with one of the tuples in T (but this may happen, that is, s does not share a connection key with T, or vice versa ). In this example, the algorithm mentioned above works normally. Reducer receives a list of keys and values as follows:
K23 → [(s64, s64), (t84, t84)]
K37 → [(s68, s68)]
K59 → [(t97, t97), (s81, s81)]
K61 → [(t99, t99)]
...
Because we send the connection key as the intermediate key, we can divide it from the value by the space-saving 12. If a key has two values, one of them must come from S and the other must come from T. However, in the basic mapreduce programming mode, the order of values cannot be guaranteed. All the first values may come from S or T. We can continue to connect two datasets and perform additional calculations (for example, filter other attributes and calculate the total number ). If a key corresponds to only one value, this means that no other data set's tuples have the same connection key, so CER does not need to do anything.
12 It is not important to compress the intermediate data.
Now let's consider the next one-to-multiple connection. Assume that the tuples in s have a unique connection key (that is, k is the primary key in S), so s is "1" t is "more ". The algorithm mentioned above can still work, but when processing each key in cer CER, we do not know when values related to the S tuples will be encountered, because values are sorted randomly. The simplest solution is to cache all values in the memory, select the tuples from S, and then intersect with every tuples in t to execute the connection. However, we have encountered this situation many times before, and it will cause a scaling bottleneck, because we may not have enough memory to store all the tuples with the same connection key.
This problem requires secondary sorting. The solution is the "key-value conversion" mode we just mentioned.
In Mapper, we create a hybrid key that contains the connection key and tuples ID (obtained from S or t) instead of simply sending the connection key as the intermediate key. There are two other things to be changed: first, we must define the key order so that it can first sort by the connection key, then sort by the ID of the tuples of S, and finally sort by the ID of T. Second, we must define partitioner to track the connection key, so that all the hybrid keys with the same connection key are uploaded to the same CER Cer.
After the "key-value conversion" mode is applied, the reducer will get a key value similar to the following:
(K82, s105) → [(s105)]
(K82, t98) → [(t98)]
(K82, t101) → [(t101)]
(K82, Tsung) → [(Tsung)]
...
Because the connection keys and tuples are all in the intermediate key, we can remove them in the value to save space 13. When CER encounters a new connection key, it can ensure that the associated value is the tuples obtained from S. Reducer can save the tuples in memory and then interact with the tuples in T in the Next Step (until a new connection key is encountered. Because the mapreduce framework is used to execute sorting, you do not need to cache tuples (different from a single one in S ). Therefore, we have eliminated the scalability bottleneck. Finally, let's consider many-to-many connections. If S is a small dataset, the above algorithm can still work. Think about what will happen in cer Cer.
13 again, it is not important to compress intermediate data.
(K82, s105) → [(s105)]
(K82, S124) → [(S124)]
...
(K82, t98) → [(t98)]
(K82, t101) → [(t101)]
(K82, Tsung) → [(Tsung)]
...
If the connection key is the same as the connection key of all tuples in S, the reducer can be cached in the memory. CER processes every tuples in T and interacts with every tuples in S. Of course, we assume that the tuples in S (with a joint key) can be stored in the memory, this is the limitation of this algorithm (and why we want to control the sorting order so that small datasets can be passed in first ).
The basic idea of joining in reduce is to re-allocate two datasets through the connection key. This method is not particularly effective because it needs to clean two datasets in the network. Next we will introduce the map connection.
3.5.2 map-side join)
Suppose we classify two datasets by using the connection key. We can scan two datasets synchronously to perform the connection operation-this is called a merge connection in the database community. We can split and sort two databases to achieve parallelism. For example, if both S and T are divided into 10 files, use the connection key to separate them in the same way. It is further assumed that the tuples in each file are classified by the connection key. In this example, we need to combine the first file of S and the first file in T, the second file of S, and the second file of T. In the map stage of a mapreduce job, you can run the mapreduce job in parallel-this is the map-side connection. In fact, we traverse one of the datasets (a relatively large one) and read the relevant parts of another dataset in the Mapper to execute the merge connection 14. This does not require CER participation unless the programmer wants to reassign the output for more processing.
14 this often indicates that this is not a local read
The map end is much more efficient than the reduce end because it does not need to transmit data sets over the network. In reality, can it meet the connection environment of the map end as expected? In most cases, this is the case. The reason is that the related connection occurs in a workflow in a wider context and may contain multiple steps. Therefore, the connected dataset must be the output (whether it is mapreduce jobs or other code) processed previously ). If we can know in advance that the workflow process is relatively unchanged (these are two reasonable assumptions about mature workflows), we can make efficient map-side connections possible (in mapreduce, by using a custom partitioner and controlling the sort order of key-value pairs ).
For special data analysis, connections at the reduce end are more common. Even though the efficiency is low, considering that the dataset has multiple keys, one of them needs to be connected-and no matter how the data is formed, the map end connection will need to re-allocate the data. As an option, it is often possible to use the same Mapper and reducer to reassign a dataset. Of course, this will lead to additional costs for data transmission over the network.
This is the last constraint to be remembered when using map-side connections in hadoop-implemented mapreduce. We assume that the dataset to be connected is generated by the previous mapreduce job. Therefore, the constraint applicable to keys will be sent out in the jobs of the reduccers. Hadoop allows the producer CERs to send keys with different values than the input keys being processed (that is, the input and output keys do not need to be the same, or even different types) 15. However, if the output key and Input key of a reducer are different, the dataset output by the reducer does not need to be split in a specific partitioner (because partitioner applies to the input key rather than the output key ). Because map-side connections are based on key-to-key segmentation and sorting, reducers is used to generate data that participates in the next map-side connection and cannot send any key except the one it is processing.
3.5.3 memory-based connections (memory-backed join)
In addition to the two methods mentioned earlier, connect relevant data and balance the mapreduce framework to connect tuples with the same connection key. There is also a method of the same type based on arbitrary acquisition and exploration called "memory-based connections. The simplest version is when one of the two datasets is smaller than the memory on each node. In this solution, we can read a small dataset into the memory for each Mapper, and obtain an associated Array Based on the connection key to reduce arbitrary access to the tuples. The initialization API of ER Er (see section 3.11) can be used for this purpose. Mapper is then applied to another (large) dataset. For each input key-value pair, mapper retrieves the dataset in the memory to see if there is a connection key for a tuples that match. If yes, the connection is executed. This is considered a simple hash connection in the database Community [51].
What if the memory cannot fit any of the datasets? The simplest way is to divide it into smaller datasets, that is, to divide s into N parts, that is, S = S1 %s2 %... %sn. We can define the n size to make each part happen to be the memory size, and then run n memory-based connections. In this case, of course, you need to flow to another dataset n times.
15. For comparison, the implementation method of Google is described in section 2.2. the reducer output key must be of the same type as its input key.
There is also an alternative solution to make the memory-based connection applicable when all datasets are larger than the memory. A Distributed Key-value memory can store a dataset in memory by multiple machines and map it to other datasets. Mapper then queries the Distributed Key-value memory in parallel. If the connection key matches, the system executes connection 16. The open-source cache system memcached applies to this situation, so we call this method memcached connection. For more information about this method, refer to the technical report [95].