Hadoop learning-single table connection

Source: Internet
Author: User

I am studying hadoop. I am looking at the hadoop practice compiled by Lu jiaheng. There is a single-Table Connection Program. Now I will sort out my ideas. This is an example in the textbook.

The child-parent table must be output.

Sample input:

Child parent

Tom Lucy

Tom Jack

Jone Lucy

Jone Jack

Lucy Mary

Lucy Ben

Jack Alice

Jack jesee

Terry Alice

Terry jesee

Philip Terry

Philip Alma

Mark Terry

Mark Alma

Sample output:

Grandchildgrandparent

Tomalice

Tomjesee

Jonealice

Jonejesee

Tommary

Tomben

Jonemary

Joneben

Philip Alice

Philipjesee

Markalice

Markjesee


In fact, it is very easy to figure out the key in this case.

Solution: Single-table join

From the example input file, we can see child -- parent (child) -- parent. Through this connection, we will find the grandchild -- grandparent

For example:

Child parent

Tom Lucy

Tom Jack

Lucy Mary

Lucy Ben

Jack Alice

Jack jesee


In this way, we can easily find the following relationship:

Grandchild grandparent

Tom Mary

Tom Ben

Tom Alice

Tom jesee


We can connect like this:

Table 1: Table 2:

Child parent child parent

Tom Lucy Mary

Lucy Ben

Tom Jack Alice

Jack jesee



We can join table 1 and table 2, and then remove the second column of table 1 and the first column of Table 2. The rest is the result.

Here we can see that table 1 and table 2 are actually a table, which is a single table connection.


You can set this table to the left and right tables.

Map stage:

Split the read data into child and parent. to distinguish between left and right tables, you can add information marked with left and right tables in the output values. The left table uses parent as the key, the left Table Mark + Child serves as the value for map output, the right table Child serves as the key, and the right table mark + parent serves as the value for output.


The left and right tables are divided in the map stage, and the left and right tables are connected in the shuffle stage.

Reduce stage:

(For example, <Lucy, <lefttag: Tom, righttag: Mary, righttag: Ben>)

In this way, in the result received in the reduce stage, the value-List of each key contains the relationship between grandchild (left: Tom) and grandparnet (righttag: Mary, rgihttag: BEN, then, the value is parsed, And the lefttag tag is saved to the grandchild [] array, and the righttag tag is saved to the grandparent [] array, then, evaluate the Cartesian product for the grandchild [] and grandparent [].


The following is the program code:

Package cn.edu. ytu. botao. singletablejoin; import Java. io. ioexception; import Java. util. stringtokenizer; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. FS. filesystem; import Org. apache. hadoop. FS. path; import Org. apache. hadoop. io. text; import Org. apache. hadoop. mapreduce. job; import Org. apache. hadoop. mapreduce. mapper; import Org. apache. hadoop. mapreduce. reducer; import Org. apache. hadoop. ma Preduce. lib. input. fileinputformat; import Org. apache. hadoop. mapreduce. lib. output. fileoutputformat; import Org. apache. hadoop. util. genericoptionsparser;/***** single table connection ** child parent * Tom Lucy * Tom Jack * Lucy Mary * Lucy Ben ** left table: reverse output <key parent, value chlid> * Lucy Tom * Jack Tom ** the right table is positively output <key child, value parent> * Lucy Mary * Lucy Ben ** after the connection: ** <Tom, <Mary, Ben> ** @ author botao * */Public class stjoin {Private Static int time = 0; public static class stjmapper extends mapper <object, text, text, text >{// mark the table private text lefttag = new text ("1"); // The left table private text righttag = new text ("2 "); // right Table @ overrideprotected void map (Object key, text value, context) throws ioexception, interruptedexception {// todo auto-generated method stubstring childname = new string (); string parent Name = new string (); // read content string line = value. tostring (); stringtokenizer tokenizer = new stringtokenizer (line); // string [] values = new string [2]; int I = 0; while (tokenizer. hasmoreelements () {values [I ++] = (string) tokenizer. nextelement ();} If (Values [0]. compareto ("child ")! = 0) {childname = values [0]; parentname = values [1]; // The output reverse value of the Left table is grandchildcontext. write (new text (parentname), new text (lefttag. tostring () + ":" + childname); // The right table outputs a forward context. write (new text (childname), new text (righttag. tostring () + ":" + parentname) ;}} public static class stjoinreduce extends CER <text, text >{@ overrideprotected void reduce (Text key, iterable <text> values, cont EXT context) throws ioexception, interruptedexception {// todo auto-generated method stub // record grandchild information and storage int grandch1_num = 0; string [] grandchild = new string [20]; // record grandparent information and storage int grandparentnum = 0; string [] grandparent = new string [20]; If (time = 0) {context. write (new text ("grandchild"), new text ("grandparent"); time ++ ;} /*** store the value of values in the right table to grandchild [] * store the value of values in the left table to gra In ndparnet [] */For (Text text: values) {string value = text. tostring (); // system. out. println (key. tostring () + "... "+ value); string temp [] = value. split (":"); // system. out. println (key. tostring () + "..... "+ temp [0] + "... "+ temp [1]); // left table if (temp [0]. compareto ("1") = 0) {grandchild [grandch1_num ++] = temp [1];} If (temp [0]. compareto ("2") = 0) {grandparent [grandparentnum ++] = temp [1] ;}// for grandchild [] And grandparent [] evaluate the Cartesian product if (0! = Grandch1_num & 0! = Grandparentnum) {// system. out. println (grandch1_num + "... "+ grandparentnum); For (INT I = 0; I <grandchsponnum; I ++) {for (Int J = 0; j <grandparentnum; j ++) {context. write (new text (grandchild [I]), new text (grandparent [J]) ;}}}@ suppresswarnings ("deprecation ") public static void main (string [] ARGs) throws ioexception, classnotfoundexception, interruptedexception {configuration conf = new config Uration (); string [] otherargs = new genericoptionsparser (Conf, argS). getremainingargs (); If (otherargs. length! = 2) {system. err. println ("Usage: wordcount <in> <out>"); system. exit (2);} // If the out folder exists, the folder is now deleted from Path = New Path ("out"); filesystem FS = filesystem. get (CONF); If (FS. exists (PATH) {FS. delete (PATH);} job = new job (Conf, "stjoin"); job. setjarbyclass (stjoin. class); job. setmapperclass (stjmapper. class); job. setreducerclass (stjoinreduce. class); job. setoutputkeyclass (text. class); job. setoutputvaluecl Ass (text. class); fileinputformat. addinputpath (job, new path (otherargs [0]); fileoutputformat. setoutputpath (job, new path (otherargs [1]); system. exit (job. waitforcompletion (true )? 0: 1 );}}


This article is from the "botao" blog, please be sure to keep this source http://botao900422.blog.51cto.com/4747129/1549912

Hadoop learning-single table connection

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.