Join the normal table three algorithms (join one) nested loop join (Nested Loops join), sort Merge Join (Sort-merge join), and hash join (hash join) _hadoop

Source: Internet
Author: User
Tags joins
Refer to http://mysun.iteye.com/blog/1748473 one of the join series through Map-reduce


The opening of this series begins with a look at the more extensive and popular central join algorithms currently in use in the database before you mention using Map-reduce to implement joins. They are nested loops join (Nested Loops join), sort Merge Join (Sort-merge join), and hash join (hash join).
1. Nested loops Join
For each record in the Java code for r, each record s do if (R and S satisfies the join condition) in the R do for S will merge R and S, and then output end to end F Or
This join algorithm is simple to implement and can support any type of join condition. However, when the number of records contained in the two sets is larger, the performance drop is very severe, because the time complexity is O (m*n) for a set S of set R and M Records with N Records.
2. Sort Merge Join
To merge a sort join, you should use a more general algorithm, which is to first sort the two sets, for both the set P and Q that require a join, and the attributes that are used when the collection is sorted is the attribute that is required for the join. After sorting is complete, use the following algorithm to merge the two collections:
Java code   p∈p;q∈q;gq∈q   while q and Records  do       while  p.a  gq.b do            make GQ point to the next record in set Q         end while       while p.a ==  gq.b do           q = gq //found two entries to join.            while p.a == q.b do                 will record p and q join after output                 Q points to the next record in set Q             end while            Make P point to the next record in the set P        end while       gq =  q //record queries can be used to join Records    end while  
Suppose the records in the data set P and Q are as follows:
Set P:
A B ABC 1 ABC 2 ABC 9 ABC 8 ABC 0 ABC 3 ABC 5 ABC 7 ABC 4 ABC 6
Set Q:
C B def 5 def 6 def 9 def 4 def 6 def 3 def 8 def 8 def 4 def 5
We join the two data sets through column B, and we first need to sort the two sets on column B, and we get the following results:
Set P:
A B ABC 0 ABC 1 ABC 2 ABC 3 ABC 4 ABC 5 ABC 6 ABC 7 ABC 8 ABC 9
Set Q:
C B def 3 def 4 def 4 def 5 def 5 def 6 def 6 def 8 def 8 def 9
According to the algorithm, when you first find two records that can join, the position of the two record pointers to the set P and set Q is as shown in the following illustration:

After discovering that there are records that can join, the two records that P and Q point to will be join, depending on the algorithm. Then the output, then Q points to the next record, this time found that P and Q of the B-column value is not equal, according to the algorithm P will point to the next record, because this time P and Q point to the B-column value is equal, So the first two while loops in the algorithm are skipped, go straight into the third while loop, find a loop join a record with a B value of 4 in the set P and two records with the B column value 4 in set Q, and after the loop ends, p and Q point to the following figure:

The algorithm continues to execute, knowing that all records with a B-column value equal in both sets are join, and the algorithm ends.
3. Hash join
A hash join requires that one of the two data sets being join be loaded into the hash table of the memory. Therefore, this join method applies to the two data sets that are being join, a scenario where the amount of data in a collection is small and can be completely put into memory. The pseudo code for this join method is as follows, where there are two data sets, p and Q respectively, and the collection P data is small, and can be loaded into an in-memory hash table:
The records in the Java Code for collection p do put p in the in-memory hash table h in the end for set q in records Q do if h have records and Q in join conditions match p and Q as join exercises And then output the result to the end if
This join algorithm can also be used only for equivalent join operations. This is faster than sorting merge joins, but consumes a lot of memory.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.