MapReduce table JOIN operation map end Join__hadoop Table connection

Source: Internet
Author: User

One: Background

MapReduce provides table join operations including a map-side join, areduce-side join , and a semi-join , and now we're talking about a map-side join. A map-side join is a merging of data before it reaches the map processing function, which is much more efficient than the reduce-side join because the reduce-side join is shuffle all the data and consumes resources.


Second: Technology realization

Basic ideas:

(1): two files to join, one stored in HDFs, and one using Distributedcache.addcachefile () to add another file that requires a join to all the map caches.

(2): read the file in the map function, join

(3): output the results to reduce

(4): Distributedcache.addcachefile () needs to be set before the job is submitted.


what is Distributedcache.

Distributedcache is a file distribution tool designed to facilitate user development of applications. It is able to automatically distribute read-only external files to each node for local caching so that the task runtime loads.


Distributedcache steps to use

(1): upload files in HDFs (text file, compressed file, jar package, etc.)

(2): invoke the relevant API to add file information

(3):task run directly call the file read and write API to get files.

Common APIs:

Distributedcache.addcachefile ();

Distributedcache.addcachearchive ();


Here's an example to drill down into the map-side join.

Table I: Tb_a data are as follows

Name	sex	age	depno
Zhang	male	1
li	female	2
Wang	female	3
Zhou	male	2

Table II: Tb_b data are as follows

Depno	depname
1	Sales
2	Dev
3	Mgt

#需求就是连接上面两张表


Note: In a map-side join operation, we tend to add smaller tables to memory because memory resources are valuable, and that is another problem, which is that if the table has very large amounts of data, it is not appropriate to use a map-side join.


The code is as follows:

public class Mymapjoin {//define input path private static String input_path1 = "";
	Path of the table loaded into memory private static String input_path2 = "";

	Define output path private static String Out_path = "";
			public static void Main (string[] args) {try {//Create configuration information Configuration conf = new Configuration ();
			Gets the parameters of the command line string[] Otherargs = new Genericoptionsparser (conf, args). Getremainingargs (); Interrupt Program if (otherargs.length!= 3) {System.err.println ("usage:mymapjoin<in1> <in2> <out>") when the argument is illegal
				);
			System.exit (1);
			}//To the path assigned value input_path1 = otherargs[0];
			Input_path2 = otherargs[1];
			Out_path = otherargs[2];
			Create File system filesystem FileSystem = Filesystem.get (new URI (Out_path), conf);
			If the output directory exists, we delete the IF (filesystem.exists (Out_path)) {Filesystem.delete (New path (Out_path), true);

			///Add to In-memory files (how many files to add) Distributedcache.addcachefile (new Path (input_path2). Touri (), Conf); Create a task Job Job = new Job (conf, Mymapjoin.class.GetName ());
			Hit the jar package to run, this sentence is the key job.setjarbyclass (Mymapjoin.class);
			1.1 Set input directory and set input data format class fileinputformat.setinputpaths (Job, input_path1);

			Job.setinputformatclass (Textinputformat.class);
			1.2 Set the type of key and value for the custom mapper class and set the map function output data job.setmapperclass (Mapjoinmapper.class);
			Job.setmapoutputkeyclass (Nullwritable.class);

			Job.setmapoutputvalueclass (Emp_dep.class);
			1.3 Set up zoning and reduce quantity job.setpartitionerclass (Hashpartitioner.class);

			Job.setnumreducetasks (0);
			Fileoutputformat.setoutputpath (Job, New Path (Out_path));

		Commit Job Exits System.exit (Job.waitforcompletion (true)? 0:1);
		catch (Exception e) {e.printstacktrace (); } public static class Mapjoinmapper extends Mapper<longwritable, Text, nullwritable, emp_dep> {private map&

		Lt;integer, string> joindata = new Hashmap<integer, string> (); @Override protected void Setup (Mapper<longwritable, Text, nullwritable, Emp_dep>. Context context) throws IOException, Interruptedexception {//preprocessing loads the file to be associated into the cache path[] paths = Distributedcache.getlocalcachefiles (context.getconfiguratio
			n ()); We only cache one file here, so take the first one and create Bufferreader to read BufferedReader reader = new BufferedReader (new FileReader (paths[0).

			ToString ()));
			String str = NULL; try {//one row read while (str = Reader.readline ())!= null) {//Split table in cache string[] splits = Str.split ("\
					T ");
				The useful data in the character array exists in a map joindata.put (Integer.parseint (Splits[0]), splits[1]);
			} catch (Exception e) {e.printstacktrace ();
			} finally{Reader.close (); } @Override protected void map (longwritable key, text value, mapper<longwritable, text, nullwritable, EMP_DEP ". Context context) throws IOException, interruptedexception {//Get table loaded from HDFs string[] values = value.tostring (). s
			Plit ("T");
			Create Emp_dep object EMP_DEP EMP_DEP = new EMP_DEP ();
			Set Properties Emp_dep.setname (Values[0]);
			Emp_dep.setsex (Values[1]); Emp_dep.setage (inTeger.parseint (values[2]));
			Gets the associated field depno, this field is the key int depno = Integer.parseint (values[3]);
			Gets the property to be associated with the depno from an in-memory association table depname String depname = Joindata.get (depno);
			Setting Depno emp_dep.setdepno (Depno);

			Setting Depname emp_dep.setdepname (Depname);
		Write out Context.write (Nullwritable.get (), EMP_DEP);
 }
	}
}
Results of program operation:


Note: This program does not know why to play a jar package in order to run, directly through the eclipse linked server run will fail (if you have known friends, please advise.) )

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.