From: http://blog.csdn.net/leoleocmm/article/details/8602081
1. Overview
In traditional databases (such as MySQL), join operations are very common and time-consuming. It is also common and time-consuming to perform join operations in hadoop. Due to the unique design concept of hadoop, there are some special skills when performing join operations.
This article first introduces the common join implementation methods on hadoop, and then provides several Optimization Methods for different input datasets.
2. Introduction to common join Methods
Assume that the data to be joined comes from file1 and file2.
2.1 reduce side join
Reduce side join is the simplest join method. Its main idea is as follows:
In the map stage, the map function reads two files, file1 and file2 at the same time. To distinguish the key/value pairs of the two sources, a tag is assigned to each data entry, for example: tag = 0 indicates from file file1, tag = 2 indicates from file file2. That is, the main task in the map stage is to tag data in different files.
In the reduce stage, the reduce function obtains the value list from file1 and file2 files with the same key, and then performs join (Cartesian Product) on the data in file1 and file2 for the same key ). That is, the actual connection operation is performed in the reduce stage.
2.2 map side join
Reduce side join exists because all required join fields cannot be obtained in the map stage, that is, the fields corresponding to the same key may be located in different maps. Reduce side join is very inefficient, because the shuffle stage requires a large amount of data transmission.
Map side join is optimized for the following scenarios: two tables to be connected, one of which is very large and the other is very small, so that small tables can be directly stored in the memory. In this way, we can copy multiple copies of a small table so that each map task memory contains one copy (for example, stored in a hash table), and then only scan the large table: for each key/value record in a large table, check whether there are records with the same key in the hash table. If so, connect and output the records.
To support file replication, hadoop provides a distributedcache class, which can be used as follows:
(1) The user uses the static method distributedcache. addcachefile () specifies the file to be copied. Its parameter is the file URI (if it is a file on HDFS, it can be like this: HDFS: // namenode: 9000/home/XXX/file, where 9000 is the namenode port number configured by yourself ). Jobtracker obtains the URI list before the job starts and copies the corresponding files to the local disks of each tasktracker. (2) The user uses the distributedcache. getlocalcachefiles () method to obtain the file directory and uses the standard file read/write API to read the corresponding file.
2.3 semijoin
Semijoin, also known as semi-join, is a method used for reference in distributed databases. The motivation is: For reduce side join, the data transmission volume across machines is very large, which becomes a bottleneck of Join Operations, if you can filter out data that does not participate in the join operation on the map end, it can greatly save network I/O.
The implementation method is very simple: select a small table, suppose it is file1, extract the key involved in the join and save it to file file3. The file3 file is usually very small and can be stored in the memory. In the map stage, use distributedcache to copy file3 to each tasktracker, and filter out the records corresponding to the key in file2 that is not in file3. The rest of the reduce stage works the same as reduce side join.
For more information about semi-join, refer to: semi-join Introduction: http://wenku.baidu.com/view/ae7442db7f1922791688e877.html
2.4 reduce side join + bloomfilter
In some cases, the key set of small tables extracted by semijoin cannot be stored in the memory. At this time, bloomfiler can be used to save space.
The most common function of bloomfilter is to determine whether an element is in a set. The two most important methods are add () and contains (). The biggest feature is that false negative does not exist. That is, if contains () returns false, the element must not be in the set, but there will be a certain value of true negative, that is, if contains () if the return value is true, the element must be in the set.
Therefore, the key in a small table can be saved to bloomfilter. When filtering large tables in the map stage, some records that are not in the small table may not be filtered out (but the records in the small table will not be filtered out ), this does not matter, but adds a small amount of network I/O.
For more information about bloomfilter, see: http://blog.csdn.net/jiaomeng/article/details/1495500
3. Secondary sorting
In hadoop, keys are sorted by default. What if values are to be sorted? That is, for the same key, the value list received by the reduce function is sorted by value. Such application requirements are common in join operations. For example, if you want to keep the values of small tables in the same key.
There are two methods for secondary sorting: Buffer and in memory sort and value-to-key conversion.
For buffer and in memory sort, the main idea is: In the reduce () function, save all values corresponding to a key and sort them. The biggest drawback of this method is that it may cause out of memory.
For value-to-key conversion, the main idea is to splice the key and some values into a combination key (to implement the writablecomparable interface or call the setsortcomparatorclass function ), in this way, the result obtained by reduce is sorted by key first, and then by value. Note that you need to implement paritioner by yourself so that data can be divided only by key. Hadoop explicitly supports secondary sorting, there is a setgroupingcomparatorclass () method in the configuration class, can be used to set the key value of the sorting group, see: http://www.cnblogs.com/xuxm2007/archive/2011/09/03/2165805.html
4. Postscript
I have been looking for a job recently. Since I have been familiar with hadoop on my resume, almost every interviewer will ask hadoop-related things, and the implementation of hadoop join becomes a mandatory question, A few companies also involve distributedcache principles and how to use distributedcache for join operations. This article is specially compiled to better respond to these interviewers.
5. References
(1) books data-intensive Text Processing with mapreduce page 60 ~ 67 Jimmy Lin and Chris Dyer, University of Maryland, College Park
(2) Books hadoop in action page 107 ~ 131
(3) mapreduce secondary sorting secondarysort: http://www.cnblogs.com/xuxm2007/archive/2011/09/03/2165805.html
Semi-join Introduction: http://wenku.baidu.com/view/ae7442db7f1922791688e877.html
(5) bloomfilter Introduction: http://blog.csdn.net/jiaomeng/article/details/1495500
(6) This article from: http://dongxicheng.org/mapreduce/hadoop-join-two-tables/
Introduction to two-table join solutions in mapreduce