Hadoop MapReduce advanced using distributed caching for replicated Join__mapreduce

Source: Internet
Author: User
Tags joins hadoop mapreduce
Concept:

Reduce-side Join technology is flexible, but sometimes it still becomes extremely inefficient. Since the join does not begin until the reduce () phase, we will pass all shuffle data across the network, and in most cases we will lose most of the data passed in the join phase. So we expect to be able to complete the join operation in the map phase.Main technical difficulties:The main difficulty in completing a join in the map phase is that mapper may need to join with a data that it cannot obtain, and if we can guarantee that such data can be mapper, then this technique is available. For example, if we know that two source data is divided into partition of the same size, and each partition is sorted by a key value that fits as a join key, then each mapper () can get the data needed for all join operations. In fact, Hadoop's Org.apache.hadoop.mared.join package contains such helper classes to implement Mapside joins, but unfortunately, this is too little. And the use of such a class can cause additional overhead. Therefore, we will not continue to discuss this package.under what circumstances. Scenario 1: If we know that two source data is divided into partition of the same size, and that each partition is sorted with a key value that fits as a join key
Scenario 2: When you join large data, there is usually only one source data that is huge, and the other data can be reduced in order of magnitude. For example, a phone company's user data may have only thousands user data, but his transaction data may have 1 billion specific phone records at more than one order of magnitude. When small data sources can be allocated to mapper memory, we can achieve significantly improved performance by simply copying small data sources to each mapper machine, so that mapper is in the map phase of the join operation. This operation is called replicate join.
Solution:Hadoop has a mechanism called distributed caching (distributed cache) to distribute data to all nodes on the cluster. It is commonly used to distribute all mapper needed files containing "background" data. For example, if you use Hadoop to classify documents, you may have a list of keywords, and you will use the distributed cache to ensure that all mapper have access to these keywords ("background data"). Procedure: 1. Distribute data to each node:[Java]View Plaincopy Distributedcache.addcachefile (NewPath (Args[0]). Touri (), Conf);[Java]View Plain copy distributedcache.addcachefile (new Path (Args[0)). Touri (), Conf); 2. Use Distributedcache.getlocalcachefiles () on each mapper to get the file, and then do the appropriate action:[Java]View Plaincopy distributedcache.getlocalcachefiles ();[Java]View plain copy distributedcache.getlocalcachefiles ();
Emerging Issues: Another limitation of ours is that the table of one of our joins must be small enough to be saved in memory. Although in asymmetrical input data, the smaller data may still be small enough to fit into memory. 1. We can make them work by rearranging the data processing steps. For example, if you need all users to sort data in area 415, connecting orders and Customers tables before filtering a certain record is correct, but not efficient. Customers and Orders tables can be too large to be put into memory. At this point we can preprocess the data so that the customers or Orders table becomes smaller. 2. Sometimes, no matter how preprocessing the data does not make the data small enough, we should filter out the users who are not part of the 415 zone in the map. See "Hadoop in Action" Chapter5.2.3 Semijoin

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.