Analysis of the MapReduce join method in SQL join intermediate--hive

Source: Internet
Author: User
Tags one table

1. Overview .

This paper mainly introduces how to implement two table joins on the MapReduce framework.

2. Introduction to Common join methods

Assume that the data you want to join IS from File1 and File2, respectively.

2.1 Reduce side Join

Reduce side join is one of the simplest ways to join, and its main ideas are as follows:

In the map phase, the map function reads two files simultaneously File1 and File2, in order to distinguish between two sources of key/value data pairs, a label for each data (tag), such as: tag=0 from the file file1,tag=2 representation from the file File2. That is, the main task of the map phase is to label the data in different files.

In the reduce phase, the reduce function obtains the same key as the value list from the File1 and File2 files, and then joins the data in File1 and File2 for the same key (the Cartesian product). That is, the reduce phase makes the actual connection operation.

2.2 Map Side Join

Reduce side join exists because you cannot get all the required join fields in the map phase, that is, the fields corresponding to the same key may be in different maps. The Reduce side join is very inefficient because of the large amount of data transferred during the shuffle phase.

The Map side join is optimized for the following scenarios: Two tables to be joined, one table is very large, and the other table is very small, so that the table can be stored directly in memory. Thus, we can copy the small table multiple copies, so that each map task memory in a copy (such as stored in a hash table), and then only scan large tables: For each record in the large table key/value, in the hash table to find whether there is the same key record, if there is, Then the output can be after the connection.

To support file replication, Hadoop provides a class Distributedcache that uses the following methods:

(1) The user uses the static method Distributedcache.addcachefile () to specify the file to be copied, whose parameters are the URI of the file (if it is a file on HDFs, you can: hdfs://namenode:9000/home/xxx/ File, where 9000 is the Namenode port number you configured). Jobtracker gets the list of URIs before the job is started and copies the corresponding files to the local disks on each tasktracker. (2) The user obtains the file directory using the Distributedcache.getlocalcachefiles () method and reads the corresponding file using the standard file read/write API.

2.3 Semijoin

Semijoin, also known as semi-connection, is a method borrowed from a distributed database. It is motivated by the fact that for reduce side join, the amount of data transferred across the machine is very large, which becomes a bottleneck of join operation, which can greatly save network IO if it can filter out data that does not participate in the join operation on the map side.

The implementation method is simple: Select a small table, assuming that it is File1, the key that participates in the join is extracted, saved to the file File3, File3 file is generally small, can be put into memory. In the map phase, the File3 is copied to each Tasktracker using Distributedcache, and the records in the File2 that are not in File3 are filtered out, and the remainder of the reduce phase works the same as the reduce side join.

More about the introduction of semi-connections.

2.4 Reduce side join + Bloomfilter

In some cases, the semijoin extracted by the key collection of the small table in memory still does not hold, this time can use Bloomfiler to save space.

The most common function of bloomfilter is to determine whether an element is in a set. Its two most important methods are: Add () and contains (). The biggest feature is that false negative is not present, that is, if contains () returns false, the element must not be in the collection, but there is a certain true negative, that is, if contains () returns True, the element may be in the collection.

So you can save the key in the small table in the Bloomfilter, filter the large table in the map stage, there may be some records not in the small table is not filtered (but the records in the small table must not be filtered out), it does not matter, just add a small amount of network IO.

For more information about Bloomfilter, refer to: http://blog.csdn.net/jiaomeng/article/details/1495500

One of the 3.join optimization methods---------------Two-time sequencing

In Hadoop, by default, sort by key, what if you want to sort by value? That is, the value list that is received for the same key,reduce function is sorted by value. This application requirement is common in join operations, for example, if you want the same key, the corresponding value of the small table is in front.

There are two methods for sorting two times: buffer and in memory sort and value-to-key conversion.

For buffer and in memory sort, the main idea is that in the reduce () function, all the value corresponding to a key is saved and then sorted. The biggest drawback of this approach is that it may cause out of memory.

For Value-to-key conversion, the main idea is to stitch the key and part of value together into a combination key (implementing the Writablecomparable interface or calling the Setsortcomparatorclass function), In this way, the result of reduce is to sort by key first, and then by value, it is important to note that users need to implement paritioner themselves so that the data can be divided only by key. Hadoop explicitly supports two ordering, and there is a Setgroupingcomparatorclass () method in the configuration class that can be used to set the key value of the sort group.

4. Summary

To summarize, the core of join is the small table memory can be put down, can put down the map join

Not fit, reduce join.

Reduce joins a large amount of network and disk IO, with poor performance and a way to optimize:

     method One : Semi-join, map filter join demand columns and key

     method Two :bloomfilter filter must not be a key record in the driver table

     method Three : two-time sort. Easy to distinguish between two tables and merge join.

Analysis of the MapReduce join method in SQL join intermediate--hive

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.