The five strategies of using MAPREDUCE+HDFS and mass data to weigh heavily

Source: Internet
Author: User
Keywords dfs nbsp;
Tags apply bandwidth block change data distributed duplicate data deletion example

With the rapid growth of the amount of storage data, more and more people begin to pay attention to the method of reducing storage data. Data compression, single-instance storage, and duplicate data deletion are frequently used storage data reduction techniques.

Duplicate data deletion often refers to the elimination of redundant child files. Unlike compression, duplicate data deletion does not change the data itself, but eliminates the storage capacity that the same data occupies. Data de-duplication has a significant advantage in reducing storage and reducing network bandwidth, and is useful for scalability.

As a simple example, we can see the shadow of deleting duplicate data in the application for detailed call details for telecommunications operations. Similarly, for communication networks that contain the same packets, we can use this technique for optimization.

In a storage schema, some common methods for deleting duplicate data include hashing, binary comparisons, and incremental differencing. In Hadoopsphere This article, you will focus on how to use MapReduce and HDFs to eliminate duplicate data. (The methods listed below include some scholars ' experimental methods, so defining the term as a strategy is appropriate).

Strategy 1: Use only HDFs and MapReduce

Owen O ' Malley suggests using the following methods in a forum post:

let your historical data be sorted by MD5 values. Run a mapreduce job and sort your new data by MD5. Note that you have to do a whole sort of all the data, but because MD5 is evenly distributed throughout the key space, sorting becomes easy. Basically, you pick the number of a reduce job (such as 256) and then take the first n bits of the MD5 value to do your reduce job. Since this assignment only deals with your new data, this is very fast. Next you need to make a map-side join, and each merged input chunk contains a range of MD5 values. Recordreader reads historical and new datasets and merges them in a certain way. (You can use the Map-side join library). Your map merges new data with old data.  This is just a map job, so it's also very fast. Of course, if the new data is small enough, you can read it in each map job and keep the new records (sorted in RAM) in the right amount of time so that you can perform the merge in RAM. This allows you to avoid the steps for sorting new data. Optimizations like this merge are a lot of details that are hidden from developers in pig and hive.

Strategy 2: Use HDFs and HBase

In a paper entitled "A novel technique for removing duplicate data in engineering cloud Systems", Zhe Sun, June Shen, Jianming Young, jointly presented a method of using HDFs and HBase, which reads as follows:

uses the MD5 and SHA-1 hash functions to compute the hash value of the file. The value is then passed to HBase to compare the new hash value to the existing range, if the new value already exists in HBase to repeat the table, HDFs checks the number of links, and if the number is not zero, the counter of the hash value will increase by 1. If the number is 0 or the hash value does not exist in the previous repeating table, HDFS will require the client to upload the file and update the file's logical path. HDFs will store the source files uploaded by the user, along with the corresponding linked files, which are automatically generated. The linked file records the hash value of the source file and the logical path of the source file.

Be aware of some of the key points in using this approach:

file-level de-duplication needs to keep the number of indexes as small as possible, which makes it efficient to find. MD5 and SHA-1 need to be used in conjunction to avoid accidental collisions.

Strategy 3: Using Hdfs,mapreduce and storage controllers

Combined by NetApp engineers Ashishkathpal, Gauravmakkar, and John Mathew, in an article called "Distributed Duplicate detection in late processing of duplicate data deletion", This paper proposes to replace NetApp's original duplicate detection by using Hadoopmapreduce's duplicate detection mechanism, and the Hadoop workflow based on duplicate detection includes the following links:

The data fingerprint (fingerprint) from the storage controller to the HDFS to generate a data fingerprint database and permanently store the database on HDFs using MapReduce to filter out duplicate records from a data fingerprint recordset and save the duplicate data fingerprint table back to the storage controller.

A data fingerprint is a computed hash index of a file block in a storage system, which generally means that the data fingerprint is much smaller than the data block it represents, thus reducing the amount of data transferred in a distributed detection network.

Strategy 4: Using Streaming,hdfs,mapreduce

For the application integration of Hadoop and streaming, there are basically two possible scenarios. For example, with IBM Infosphere streams and biginsights integration, the scenario should be:

1. Streams to Hadoop: through the control process, the Hadoop mapreduce module is used as part of the data flow analysis, for streams operations need to check and weight the updated data and verify the correctness of the MapReduce model.

It is well known that duplication of data is most effective when data is consumed, so that a particular time period or number of records is duplicated in infosphere streams, or the incremental portion of the record is identified. Next, the data that goes heavy will be sent to the Hadoop biginsights for the creation of a new model.


2. Hadoop to Streams process: In this way, the Hadoop mapreduce is used to remove duplicate data from historical data, after which the MapReduce model is updated. The MapReduce model is integrated as part of the streams, and an operator (operator) is configured for mid-stream to process the incoming data.

Strategy 5: Combining block technology with MapReduce

In a prototype tool developed at the University of Leipzig Dedoop (deduplication with Hadoop), MapReduce is applied to entity parsing in large data, so far This tool encompasses the most sophisticated applications of mapreduce in data deduplication.


The chunking based on entity matching is the semantic chunking of input data according to similar data and the qualification of entities of the same block.

The entity parsing process is divided into two mapreduce jobs: The analysis job is mainly used for statistic record frequency, matching job is used to deal with load balance and approximation calculation. In addition, the matching job uses the "greedy mode" load balancing regulation, that is, the matching tasks are sorted in descending order of the task processing data size, and the minimum load of reduce assignment is made.

Dedoop also uses effective techniques to avoid redundant pairing comparisons. It requires the MR Program to clearly define which reduce task is working on which pairing comparisons, so that there is no need to perform the same pairing comparison on multiple nodes.

Original link: Data deduplication tactics with HDFS and MapReduce

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.