Hadoop MapReduce advanced using distributed caching for replicated Join_

Hadoop MapReduce advanced using distributed caching for replicated Join__mapreduce

Last Update:2018-08-21 Source: Internet

Author: User

Tags joins hadoop mapreduce

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Concept:

Reduce-side Join technology is flexible, but sometimes it still becomes extremely inefficient. Since the join does not begin until the reduce () phase, we will pass all shuffle data across the network, and in most cases we will lose most of the data passed in the join phase. So we expect to be able to complete the join operation in the map phase.Main technical difficulties:The main difficulty in completing a join in the map phase is that mapper may need to join with a data that it cannot obtain, and if we can guarantee that such data can be mapper, then this technique is available. For example, if we know that two source data is divided into partition of the same size, and each partition is sorted by a key value that fits as a join key, then each mapper () can get the data needed for all join operations. In fact, Hadoop's Org.apache.hadoop.mared.join package contains such helper classes to implement Mapside joins, but unfortunately, this is too little. And the use of such a class can cause additional overhead. Therefore, we will not continue to discuss this package.under what circumstances. Scenario 1: If we know that two source data is divided into partition of the same size, and that each partition is sorted with a key value that fits as a join key
Scenario 2: When you join large data, there is usually only one source data that is huge, and the other data can be reduced in order of magnitude. For example, a phone company's user data may have only thousands user data, but his transaction data may have 1 billion specific phone records at more than one order of magnitude. When small data sources can be allocated to mapper memory, we can achieve significantly improved performance by simply copying small data sources to each mapper machine, so that mapper is in the map phase of the join operation. This operation is called replicate join.
Solution:Hadoop has a mechanism called distributed caching (distributed cache) to distribute data to all nodes on the cluster. It is commonly used to distribute all mapper needed files containing "background" data. For example, if you use Hadoop to classify documents, you may have a list of keywords, and you will use the distributed cache to ensure that all mapper have access to these keywords ("background data"). Procedure: 1. Distribute data to each node:[Java]View Plaincopy Distributedcache.addcachefile (NewPath (Args[0]). Touri (), Conf);[Java]View Plain copy distributedcache.addcachefile (new Path (Args[0)). Touri (), Conf); 2. Use Distributedcache.getlocalcachefiles () on each mapper to get the file, and then do the appropriate action:[Java]View Plaincopy distributedcache.getlocalcachefiles ();[Java]View plain copy distributedcache.getlocalcachefiles ();
Emerging Issues: Another limitation of ours is that the table of one of our joins must be small enough to be saved in memory. Although in asymmetrical input data, the smaller data may still be small enough to fit into memory. 1. We can make them work by rearranging the data processing steps. For example, if you need all users to sort data in area 415, connecting orders and Customers tables before filtering a certain record is correct, but not efficient. Customers and Orders tables can be too large to be put into memory. At this point we can preprocess the data so that the customers or Orders table becomes smaller. 2. Sometimes, no matter how preprocessing the data does not make the data small enough, we should filter out the users who are not part of the 415 zone in the map. See "Hadoop in Action" Chapter5.2.3 Semijoin

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More