Bit-map method to deal with big data problems

Last Update:2015-08-15 Source: Internet

Author: User

Tags bitset

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Problem Introduction:

1. Give 4 billion non-repeating integers of the unsigned int, not ordered, and then give a number, how to quickly determine whether the number is in the 4 billion number?
2. Given an integer set of Tens data, determine which is the repeating element.
3. A TENS array of data quantities is given and sorted.
4. Find the non-repeating integers in 500 million integers (note that there is not enough memory to accommodate these 500 million integers).

From the data volume, the use of conventional methods (general sorting algorithm, one by one) is obviously inappropriate, so here we introduce a new solution, is bitmap.

Bitmap is to use a bit bit to mark the value of an element, and key is the bit order of that bit. Storage space can be saved greatly because of the use of bit units to store data. Bitmap through 1 bits to indicate a state, such as: int type has 2^32 number, namely 4G number, then each number a state, is 2^32 a bit, that is, MB (that is, with 512 megabytes of storage space can be processed 4G data, that is, 40+ billion data).

Here is a bitmap class I wrote in C + + that can dynamically request the required memory by constructing the object when it is passed in to the data size, and then process the user's large amount of data:

1#include <iostream>2#include <fstream>3#include <ctime>4 using namespacestd;5 Constunsigned SIZE =512000000;//512 MB static storage to process 4.096 billion data6 7 classBitmap {8typedefstructByte {9UnsignedCharbit8;Ten         Static ConstUnsignedCharmask[9];//a secondary array used to obtain a byte for each bit One Byte () A         { -BIT8 =0; -         } the         //This bit is set to store the number -         voidSet1 (unsigned at) -         { -Bit8 |=Mask[at]; +         } -         //reads whether the bit has a number +         BOOLGet1 (unsigned at) A         { at             returnBIT8 &Mask[at]; -         } - } Byte; -Byte *M_byte; - unsigned m_size; -  Public: in Bitmap (unsigned _size) -     { toM_byte =Newbyte[(_size+7)/8]; +M_size =_size; -     } the     Virtual~Bitmap () *     { $         Delete[] m_byte;Panax NotoginsengM_size =0; -     } the     //Store one data +     BOOLpush (unsigned data) A     { the         if(data>=m_size) +             return false; -m_byte[data/8].set1 (data%8); $         return true; $     } -     //read whether a data exists -     BOOLfind (unsigned data) the     { -         returnData>=m_size?0: m_byte[data/8].get1 (data%8);Wuyi     } the     //returns the number of data that can be stored - unsigned size () Wu     { -         returnm_size; About     } $     //overloaded operators for common functions -     //Store one data -     BOOL operator>>(unsigned data) -     { A         returnpush (data); +     } the     //read whether a data exists -     BOOL operator<<(unsigned data) $     { the         returnfind (data); the     } the     //access to a block of data thebyte&operator[] (unsigned i) -     { in         if(i>=m_size/8) the             Throw "index out of range"; the         returnM_byte[i]; About     } the }; the ConstUnsignedCharbitmap::byte::mask[9] = {0x80,0x40,0x20,0x10,0x8,0x4,0x2,0x1};//a secondary array used to obtain a byte for each bit the  + intMain () - { theBitmap Bitmap (8*size);//can store 40+ billion dataBayiIfstream file ("In.txt"); theunsigned read, i=0, T1 =clock (); the      for(i=0; i<size; ++i) -         if(file>>Read) -Bitmap>>Read; the         Else the              Break; the file.close (); thecout<<"shared storage"<<i/10000<<"W data,"<<"time consuming:"<<clock ()-t1<<"Ms"<<Endl; -T1 =clock (); the      for(i=0; i<1000000; ++i) the         if(bitmap<<i) the             ;94cout<<"Access"<<i/10000<<"W data is time consuming:"<<clock ()-t1<<"Ms"<<Endl; thecout<<"Please enter the data you want to retrieve:"<<Endl; the      while(cin>>Read) { the         if(bitmap<<Read)98cout<<"is stored"<<read<<Endl; About         Else -cout<<"Error: Not stored"<<read<<Endl;101     }102     return 0;103}

The results of the operation are as follows:

In the program, read a randomly generated 6W integer in a text file, save to this bitmap, and then test the time to find 100W data from this established bitmap (11ms or so), the next part is the user can manually enter some integers, The program automatically retrieves whether the data has been stored in the bitmap.

This will solve the first problem introduced in the topic, the input text data is changed to known 4 billion data can be (4 billion data input may take a moment, about 1300 seconds).

The following is an introduction to the remaining three problems of the problem-solving ideas.

Question 2: First set up a large enough bitmap object, and then enter the data, if the data before inputting a bit is already 1, then the data is repeated, in order to get duplicate data.

Question 3: First set up a large enough bitmap object, and then enter the data, from the beginning of the bitmap, if a bit is not 0, it means that the data, and then output the bit is not 0 bits of the order is a sorted array (output too much meaningless, you can convert the output to write files, Then the data in the new file is sorted).

Question 4: Method 1, establish 2 large enough bitmap object, sequentially input data, before entering the data in the Bitmap1 whether the existence (that is, the corresponding bit is 1), does not exist in the BITMAP1, the existence of input into the BITMAP2; Iterate through each bit in Bitmap1, if one is 1 but the corresponding bit in BITMAP2 is not 1, then the data appears only once, then output.

Method 2, establish a large enough bitmap object, but with two bits to represent a data, 00 means that the data does not exist, 01 data appears once, 10 means that the data appears multiple times. What about 11? Let's cool off, don't you, haha. In order to enter the data in turn, if the corresponding bit (in fact, two-bit) is 00 then change to 01,01 will be changed to 10,10. After the input is complete, traverse the entire bitmap, find 01 bits on the output.

Well, the common Big Data topic through bitmap this magical structure to solve, but bitmap is not omnipotent, it is obvious that it is only suitable for storing shaping data, of course, here only consider the unsigned type data, if it is an int type, It's okay to map it. But even so, it can only handle 1 billion levels of data, if the amount of data is larger, type is not just plastic?

For example: need to write a network spider (web crawler). Because of the intricate links between networks, spiders crawling between networks are likely to form "rings". To avoid a "ring", you need to know which URLs the spider has visited. To a URL, how do you know if a spider has been visited?

It is not difficult to think of the following scenarios:

1. Save all visited URLs to the database;

2. Save the URL you visited with HashSet. Just close to the price of O (1) to find out if a URL has been accessed;

3. The URL is saved to the HashSet or database after a one-way hash such as MD5 or SHA-1.

4. Bit-map method. Create a bitset that maps each URL to a single hash function.

Method is the full save of the visited URL, method 4 only marks a map bit of the URL.

The above method solves the problem perfectly in the case of small amount of data, but the problem comes when the amount of data becomes very large.

Method 1: The data volume becomes very large and the efficiency of the relational database query becomes very low. And every URL to start a database query is not too much fuss?

Method 2: Consume too much memory. As the number of URLs increases, more and more memory is consumed. Even if there are only 100 million URLs, each URL is only 50 characters, which requires 5GB of memory.

Method 3: Because the string is MD5 processed, the information digest length is only 160Bit after 128bit,sha-1 processing, so Method 3 saves several times more memory than Method 2.

Method 4: Consuming memory is relatively small, but the disadvantage is that the probability of a single hash function conflict is too high. Remember the data structure class to learn the hash table conflicts of various solutions? To reduce the probability of a conflict occurring to 1%, set the length of the bitset to 100 times times the number of URLs.

But we can consider if to some extent ignore the situation of miscarriage of wrong, then can you achieve this function by improving method 4? In fact, this is the idea of the Bloom filter algorithm : Bloom filter is a fast lookup algorithm for multi-hash function mappings proposed by Bloom in 1970. It is often applied in some cases where it is necessary to quickly determine whether an element belongs to a collection, but is not strictly 100% correct. The idea is to do some improvement on the basis of Method 4, not mapping to a bit, but through K-hash function mapping to K-bit, so that only when the new URL is computed by the K-bit is 1 o'clock to determine that the URL has been visited (there is a possibility of miscalculation, but there are related studies prove that When the appropriate K-value and bitmap digits are obtained, the rate of miscarriage of error can be so small that it can be ignored, see details)

Of course, can also be handled through the map-reduce, after all, others MapReduce is expert, professional big data processing technology!

Reference documents:

Bitmap Bitmap method

　　bloomfilter--large-scale data processing tool

Bloomfilter Concept and principle

Welcome to visit my blog, niche newcomer, where there are problems please advise, reprint please indicate the source! Http://www.cnblogs.com/webary/p/4733247.html

Bit-map method to deal with big data problems

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More