Two-table join instances in mapreduce

Source: Internet
Author: User
Tags processing text

1. Overview

In traditional databases (such as MySQL), join operations are very common and time-consuming. It is also common and time-consuming to perform join operations in hadoop. Due to the unique design concept of hadoop, there are some special skills when performing join operations.

2. Introduction to common join Methods

Assume that the data to be joined comes from file1 and file2.

Reduce side join is the simplest join method. Its main idea is as follows:

In the map stage, the map function reads two files, file1 and file2 at the same time. To distinguish the key/value pairs of the two sources, a tag is assigned to each data entry, for example: tag = 0 indicates from file file1, tag = 2 indicates from file file2. That is, the main task in the map stage is to tag data in different files.

In the reduce stage, the reduce function obtains the value list from file1 and file2 files with the same key, and then performs join (Cartesian Product) on the data in file1 and file2 for the same key ). That is, the actual connection operation is performed in the reduce stage.

Ref: hadoop join reduce side join

2.2 map side join

Reduce side join exists because all required join fields cannot be obtained in the map stage, that is, the fields corresponding to the same key may be located in different maps. Reduce side join is very inefficient, because the shuffle stage requires a large amount of data transmission.

Map side join is optimized for the following scenarios: two tables to be connected, one of which is very large and the other is very small, so that small tables can be directly stored in the memory. In this way, we can copy multiple copies of a small table so that each map task memory contains one copy (for example, stored in a hash table), and then only scan the large table: for each key/value record in a large table, check whether there are records with the same key in the hash table. If so, connect and output the records.

To support file replication, hadoop provides a distributedcache class, which can be used as follows:

(1) The user uses the static method distributedcache. addcachefile () specifies the file to be copied. Its parameter is the file URI (if it is a file on HDFS, it can be like this: HDFS: // namenode: 9000/home/XXX/file, where 9000 is the namenode port number configured by yourself ). Jobtracker obtains the URI list before the job starts and copies the corresponding files to the local disks of each tasktracker. (2) The user uses the distributedcache. getlocalcachefiles () method to obtain the file directory and uses the standard file read/write API to read the corresponding file.

Ref: Map side join for hadoop join

2.3 semi join

Semi join, also known as semi-join, is a method used for reference in distributed databases. The motivation is: For reduce side join, the data transmission volume across machines is very large, which becomes a bottleneck of Join Operations, if you can filter out data that does not participate in the join operation on the map end, it can greatly save network I/O.

The implementation method is very simple: select a small table, suppose it is file1, extract the key involved in the join and save it to file file3. The file3 file is usually very small and can be stored in the memory. In the map stage, use distributedcache to copy file3 to each tasktracker, and filter out the records corresponding to the key in file2 that is not in file3. The rest of the reduce stage works the same as reduce side join.

2.4 reduce side join + bloomfilter

In some cases, the key set of small tables extracted by semijoin cannot be stored in the memory. At this time, bloomfiler can be used to save space.

The most common function of bloomfilter is to determine whether an element is in a set. The two most important methods are add () and contains (). The biggest feature is that false negative does not exist. That is, if contains () returns false, the element must not be in the set, but there will be a certain false positive, that is, if contains () if the return value is true, the element must be in the set.

Therefore, the key in a small table can be saved to bloomfilter. When filtering large tables in the map stage, some records that are not in the small table may not be filtered out (but the records in the small table will not be filtered out ), this does not matter, but adds a small amount of network I/O.



Instance:

Address.txt

1      Beijing2      Guangzhou3      Shenzhen4      Xian


Factory.txt

AAAAA                    1BBBBB                    3CCCCC                    2DDDDD                    1FFFFFFF                  2EEEEEEE                  3GGGGGGG                  1
package com.baidu.util;import java.io.DataInput;import java.io.DataOutput;import java.io.IOException;import org.apache.hadoop.io.WritableComparable;public  class TextPair implements WritableComparable<TextPair>{public String getValue() {return value;}public void setValue(String value) {this.value = value;}@Overridepublic String toString() {return " " + key +" "+ value; }public String getFlag() {return flag;}public void setFlag(String flag) {this.flag = flag;}public String getKey() {return key;}public void setKey(String key) {this.key = key;}public String getContent() {return content;}public void setContent(String content) {this.content = content;}private String flag = "";private String key ="";private String value ="";private String content = "";public TextPair(String flag, String key, String value, String content) {this.flag = flag;this.key = key;this.value = value;this.content = content;}public TextPair() {}@Overridepublic int compareTo(TextPair o) {// TODO Auto-generated method stubreturn 0;}@Overridepublic void readFields(DataInput in) throws IOException {this.flag = in.readUTF();this.key = in.readUTF();this.value = in.readUTF();this.content = in.readUTF();}@Overridepublic void write(DataOutput out) throws IOException {out.writeUTF(this.flag);out.writeUTF(this.key);out.writeUTF(this.value);out.writeUTF(this.content);}}
Package COM. baidu. join; import Java. io. ioexception; import Java. util. arraylist; import Java. util. hashmap; import Java. util. iterator; import Java. util. list; import Java. util. stringtokenizer; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. FS. path; import Org. apache. hadoop. io. text; import Org. apache. hadoop. mapreduce. job; import Org. apache. hadoop. mapreduce. mapper; import Org. apache. hadoop. ma Preduce. reducer; import Org. apache. hadoop. mapreduce. lib. input. fileinputformat; import Org. apache. hadoop. mapreduce. lib. output. fileoutputformat; import Org. apache. hadoop. util. genericoptionsparser; import COM. baidu. util. textpair; public class joinmapper {public static int time = 0;/** in map, first distinguish whether the input row belongs to the left table or the right table, and then divide the values of the two columns, save the join column in the key value, the remaining columns and the left and right tables are marked in the value, and finally output */public static class map extends mapper <object, text, T EXT, textpair> {// implement the map function public void map (Object key, text value, context) throws ioexception, interruptedexception {string line = value. tostring (); // each file line // input a line of pre-processing text stringtokenizer itr = new stringtokenizer (line); int I = 0; string [] STRs = new string [2]; while (itr. hasmoretokens () {STRs [I] = itr. nexttoken (); I ++;} textpair pair = new textpair (); If (line. length ()> 1) {If (STRs [0]. charat (0)> = '0' & STRs [0]. charat (0) <= '9') {// addresspair. setflag ("1"); pair. setkey (STRs [0]); pair. setvalue (STRs [1]); // Beijin, 1} else {// factorypair. setflag ("2"); pair. setkey (STRs [1]); pair. setvalue (STRs [0] + "," + STRs [1]); // factory, 1 }}// output context of the left and right tables. write (new text (pair. getkey (), pair) ;}/ ** reduce parses the map output, saves the data in the value according to the left and right tables respectively, * calculates the Cartesian Product and outputs it. */Public static class reduce extends reducer <text, textpair, text, text> {Private Static hashmap <string, string> addmap = new hashmap <string, string> (1000 ); private Static list <string> faclist = new arraylist <string> (1000); Private Static hashmap <string, Boolean> citymap = new hashmap <string, Boolean> (1000 ); // implement the reduce function public void reduce (Text key, iterable <textpair> values, context) throws I Oexception, interruptedexception {iterator <textpair> ite = values. iterator (); While (ITE. hasnext () {textpair pair = (textpair) ite. next (); If ("1 ". equals (pair. getflag () {addmap. put (pair. getkey (), pair. getvalue ();} else {faclist. add (pair. getvalue () ;}/// returns the Cartesian Product for (Int J = 0; j <faclist. size (); j ++) {string [] facstrs = faclist. get (j ). split (","); If (addmap. containskey (facstrs [1]) &! Citymap. containskey (facstrs [0]) {citymap. put (facstrs [0], true); context. write (new text (facstrs [0]), new text (addmap. get (facstrs [1]) ;}}} public static void main (string [] ARGs) throws exception {configuration conf = new configuration (); // This sentence is critical. // Conf. set ("mapred. job. tracker "," 192.168.1.2: 9001 "); string [] ioargs = new string [] {"/user/root/TXT/table ","/out/Table1 "}; string [] otherargs = new gene Ricoptionsparser (Conf, ioargs). getremainingargs (); If (otherargs. length! = 2) {system. err. println ("Usage: Multiple table join <in> <out>"); system. exit (2);} job = new job (Conf, "multiple table join"); job. setjarbyclass (joinmapper. class); // sets the map and reduce processing class job. setmapperclass (map. class); job. setreducerclass (reduce. class); job. setmapoutputkeyclass (text. class); job. setmapoutputvalueclass (textpair. class); // set the output type job. setoutputkeyclass (text. class); job. setoutputvalueclass (textp Air. class); // sets the input and output directories fileinputformat. addinputpath (job, new path (otherargs [0]); fileoutputformat. setoutputpath (job, new path (otherargs [1]); system. exit (job. waitforcompletion (true )? 0: 1 );}}


Result:

AAAAABeijingDDDDDBeijingGGGGGGGBeijingCCCCCGuangzhouFFFFFFFGuangzhouBBBBBShenzhenEEEEEEEShenzhen


This article from the "Dream think Xi" blog, please be sure to keep this source http://qiangmzsx.blog.51cto.com/2052549/1559340

Two-table join instances in mapreduce

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.