I am studying hadoop. I am looking at the hadoop practice compiled by Lu jiaheng. There is a single-Table Connection Program. Now I will sort out my ideas. This is an example in the textbook.
The child-parent table must be output.
Sample input:
Child parent
Tom Lucy
Tom Jack
Jone Lucy
Jone Jack
Lucy Mary
Lucy Ben
Jack Alice
Jack jesee
Terry Alice
Terry jesee
Philip Terry
Philip Alma
Mark Terry
Mark Alma
Sample output:
Grandchildgrandparent
Tomalice
Tomjesee
Jonealice
Jonejesee
Tommary
Tomben
Jonemary
Joneben
Philip Alice
Philipjesee
Markalice
Markjesee
In fact, it is very easy to figure out the key in this case.
Solution: Single-table join
From the example input file, we can see child -- parent (child) -- parent. Through this connection, we will find the grandchild -- grandparent
For example:
Child parent
Tom Lucy
Tom Jack
Lucy Mary
Lucy Ben
Jack Alice
Jack jesee
In this way, we can easily find the following relationship:
Grandchild grandparent
Tom Mary
Tom Ben
Tom Alice
Tom jesee
We can connect like this:
Table 1: Table 2:
Child parent child parent
Tom Lucy Mary
Lucy Ben
Tom Jack Alice
Jack jesee
We can join table 1 and table 2, and then remove the second column of table 1 and the first column of Table 2. The rest is the result.
Here we can see that table 1 and table 2 are actually a table, which is a single table connection.
You can set this table to the left and right tables.
Map stage:
Split the read data into child and parent. to distinguish between left and right tables, you can add information marked with left and right tables in the output values. The left table uses parent as the key, the left Table Mark + Child serves as the value for map output, the right table Child serves as the key, and the right table mark + parent serves as the value for output.
The left and right tables are divided in the map stage, and the left and right tables are connected in the shuffle stage.
Reduce stage:
(For example, <Lucy, <lefttag: Tom, righttag: Mary, righttag: Ben>)
In this way, in the result received in the reduce stage, the value-List of each key contains the relationship between grandchild (left: Tom) and grandparnet (righttag: Mary, rgihttag: BEN, then, the value is parsed, And the lefttag tag is saved to the grandchild [] array, and the righttag tag is saved to the grandparent [] array, then, evaluate the Cartesian product for the grandchild [] and grandparent [].
The following is the program code:
Package cn.edu. ytu. botao. singletablejoin; import Java. io. ioexception; import Java. util. stringtokenizer; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. FS. filesystem; import Org. apache. hadoop. FS. path; import Org. apache. hadoop. io. text; import Org. apache. hadoop. mapreduce. job; import Org. apache. hadoop. mapreduce. mapper; import Org. apache. hadoop. mapreduce. reducer; import Org. apache. hadoop. ma Preduce. lib. input. fileinputformat; import Org. apache. hadoop. mapreduce. lib. output. fileoutputformat; import Org. apache. hadoop. util. genericoptionsparser;/***** single table connection ** child parent * Tom Lucy * Tom Jack * Lucy Mary * Lucy Ben ** left table: reverse output <key parent, value chlid> * Lucy Tom * Jack Tom ** the right table is positively output <key child, value parent> * Lucy Mary * Lucy Ben ** after the connection: ** <Tom, <Mary, Ben> ** @ author botao * */Public class stjoin {Private Static int time = 0; public static class stjmapper extends mapper <object, text, text, text >{// mark the table private text lefttag = new text ("1"); // The left table private text righttag = new text ("2 "); // right Table @ overrideprotected void map (Object key, text value, context) throws ioexception, interruptedexception {// todo auto-generated method stubstring childname = new string (); string parent Name = new string (); // read content string line = value. tostring (); stringtokenizer tokenizer = new stringtokenizer (line); // string [] values = new string [2]; int I = 0; while (tokenizer. hasmoreelements () {values [I ++] = (string) tokenizer. nextelement ();} If (Values [0]. compareto ("child ")! = 0) {childname = values [0]; parentname = values [1]; // The output reverse value of the Left table is grandchildcontext. write (new text (parentname), new text (lefttag. tostring () + ":" + childname); // The right table outputs a forward context. write (new text (childname), new text (righttag. tostring () + ":" + parentname) ;}} public static class stjoinreduce extends CER <text, text >{@ overrideprotected void reduce (Text key, iterable <text> values, cont EXT context) throws ioexception, interruptedexception {// todo auto-generated method stub // record grandchild information and storage int grandch1_num = 0; string [] grandchild = new string [20]; // record grandparent information and storage int grandparentnum = 0; string [] grandparent = new string [20]; If (time = 0) {context. write (new text ("grandchild"), new text ("grandparent"); time ++ ;} /*** store the value of values in the right table to grandchild [] * store the value of values in the left table to gra In ndparnet [] */For (Text text: values) {string value = text. tostring (); // system. out. println (key. tostring () + "... "+ value); string temp [] = value. split (":"); // system. out. println (key. tostring () + "..... "+ temp [0] + "... "+ temp [1]); // left table if (temp [0]. compareto ("1") = 0) {grandchild [grandch1_num ++] = temp [1];} If (temp [0]. compareto ("2") = 0) {grandparent [grandparentnum ++] = temp [1] ;}// for grandchild [] And grandparent [] evaluate the Cartesian product if (0! = Grandch1_num & 0! = Grandparentnum) {// system. out. println (grandch1_num + "... "+ grandparentnum); For (INT I = 0; I <grandchsponnum; I ++) {for (Int J = 0; j <grandparentnum; j ++) {context. write (new text (grandchild [I]), new text (grandparent [J]) ;}}}@ suppresswarnings ("deprecation ") public static void main (string [] ARGs) throws ioexception, classnotfoundexception, interruptedexception {configuration conf = new config Uration (); string [] otherargs = new genericoptionsparser (Conf, argS). getremainingargs (); If (otherargs. length! = 2) {system. err. println ("Usage: wordcount <in> <out>"); system. exit (2);} // If the out folder exists, the folder is now deleted from Path = New Path ("out"); filesystem FS = filesystem. get (CONF); If (FS. exists (PATH) {FS. delete (PATH);} job = new job (Conf, "stjoin"); job. setjarbyclass (stjoin. class); job. setmapperclass (stjmapper. class); job. setreducerclass (stjoinreduce. class); job. setoutputkeyclass (text. class); job. setoutputvaluecl Ass (text. class); fileinputformat. addinputpath (job, new path (otherargs [0]); fileoutputformat. setoutputpath (job, new path (otherargs [1]); system. exit (job. waitforcompletion (true )? 0: 1 );}}
This article is from the "botao" blog, please be sure to keep this source http://botao900422.blog.51cto.com/4747129/1549912
Hadoop learning-single table connection