Uses mapreduce + HDFS to remove massive data

Source: Internet
Author: User
Tags hadoop mapreduce

From: http://www.csdn.net/article/2013-03-25/2814634-data-de-duplication-tactics-with-hdfs

Abstract:With the surge in data volume collected, de-duplication has undoubtedly become one of the challenges faced by many big data players. Deduplication has significant advantages in reducing storage and network bandwidth, and is helpful for scalability. In the storage architecture, common methods for deleting duplicate data include hash, binary comparison, and incremental difference. This article focuses on using mapreduce and HDFS to deduplicate data.

With the rapid growth of the amount of information stored in data, more and more people are paying attention to the method of reducing storage data. Data Compression, single-instance storage, and deduplication are all Frequently Used storage data reduction technologies.

Deduplication usually refersRemove redundant sub-Files. Unlike compression, deduplication does not change the data itself, but eliminates the storage capacity occupied by the same data. Deduplication has significant advantages in reducing storage and network bandwidth, and is helpful for scalability.

A simple example: deduplicated call details for telecom operatorsProgram, We can see the shadow of deleting duplicate data. Similarly, we can use this technology to optimize communication networks that contain the same data packets.

In the storage architecture, some common methods for deleting duplicate data include hash, binary comparison, and incremental difference. In this article, we will focus on how to use mapreduce and HDFS to eliminate duplicate data. (The methods listed below include some experimental methods of scholars. Therefore, it is more appropriate to define the term as a policy ).

Policy 1: Use Only HDFS and mapreduce

Owen o'malley suggested the following methods in a forum post:

Sort historical data by MD5 value. Run a mapreduce job to sort your new data by MD5. Note that you need to sort all data in an even manner, but it is easy to sort because MD5 is evenly distributed throughout the key space.

Basically, you select the number of reduce jobs (such as 256), and then take the first n digits of the MD5 value for your reduce job. Since this job only processes your new data, it is very fast. Next, you need to perform a map-side join. Each merged input block contains a range of MD5 values. Recordreader reads historical and new datasets and merges them in a certain way. (You can use the map-side join library ). Your map combines new and old data. This is just a map job, so it is very fast.

Of course, if the new data is small enough, you can read it in each map job and keep the new records (sorted in Ram) within the appropriate quantity range, in this way, the merge operation can be performed in Ram. This allows you to avoid sorting new data. Similar to this merge optimization, pig and hive have hidden a lot of details from developers.

 

Policy 2: Use HDFS and hbase

Zhe sun, Jun Shen, jianming young proposed a method to use HDFS and hbase. The content is as follows:

    • UseMD5 and SHA-1The hash function calculates the file's hash value, and then passes the value to hbase
    • Compare the new hash value with the existing value range. If the new value already exists inHbase deduplication tableIn, HDFS checks the number of links. If the number is not zero, the counter corresponding to the hash value will increase by 1. If the number is zero or the hash value does not exist in the previous deduplication table, HDFS requires the client to upload the file and update the logical path of the file.
    • HDFS will store user-uploaded source files and corresponding link files, which are automatically generated. The link file records the hash value of the source file and the logical path of the source file.

 

Note thatKey Points:

    • To delete duplicate data at the file level, you must keep the number of indexes as small as possible, so that efficient search can be performed.
    • MD5 and SHA-1 must be used in combination to avoid occasional collisions.

 

Policy 3: Use HDFS, mapreduce, and storage controller

Powered by netapp engineers ashishkathpal, gauravmakkar, and Mathew John, in a project named "distributed deduplication detection method for processing duplicate data deletion at a later stage ",ArticleIn this paper, we propose to replace the existing repeat detection process of netapp by using the repeated detection mechanism of hadoopmapreduce. The hadoop Workflow Based on repeat detection mentioned in this article includes the following steps:

    • Migrate data fingerprints (fingerprint) from storage controllers to HDFS
    • Generates a data fingerprint database and permanently stores it on HDFS.
    • Use mapreduce to filter repeated records from the data fingerprint record set, and save the de-duplicated data fingerprint table back to the storage controller.

Data fingerprint refers to the hash index after the file blocks in the storage system are computed.Generally, data fingerprints are much smaller than the data blocks it represents. This reduces the amount of data transmitted over the network during distributed detection.

 

Policy 4: Use streaming, HDFS, and mapreduce

For hadoop and streaming application integration, there are basically two possible scenarios. Taking the integration of IBM Infosphere streams and biginsights as an example, the scenario should be:

1. process from streams to hadoop:Through the control process, the hadoop mapreduce module is used as a part of data stream analysis. For streams operations, the updated data needs to be checked and de-duplicated, and the correctness of the mapreduce model can be verified.

As we all know, Data deduplication is the most effective during data ingestion. Therefore, in Infosphere streams, records of a specific period or quantity are de-duplicated, or identify the incremental part of the record. Then, the de-duplicated data will be sent to hadoop biginsights for the creation of the new model.


2. hadoop-to-streams process:In this way, hadoop mapreduce is used to remove duplicate data from historical data, and the mapreduce model will be updated later. The mapreduce model is integrated as part of the streams, and an operator (operator) is configured for mid-stream to process incoming data.

Policy 5: Use mapreduce with block technology

In deduplication with hadoop, a prototype tool developed by lebixi University, mapreduce is applied to object parsing in big data. So far, this tool covers the most mature application methods of mapreduce In the deduplication technology.


Entity-based matching refers to dividing input data in a semantic manner according to similar data and limiting the entities of the same block.

Entity Parsing is divided into two mapreduce jobs: analysis jobs are mainly used to count the occurrence frequency of records, and matching jobs are used to process load balancing and approximate computing. In addition, matching jobs adopt "greedy" load balancing control, that is, matching jobs are sorted in descending order of the data size processed by the jobs, and the minimum load is allocated to reduce jobs.

Dedoop also uses effective techniques to avoid redundant matching comparisons. It requires the Mr program to clearly define which reduce task is processing which pair comparison, so that the same pair comparison does not need to be performed on multiple nodes.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.