Massive data processing

Last Update:2018-07-26 Source: Internet

Author: User

Tags file size hash modulus

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

massive data processing

Our massive data processing here is mainly through a few practical problems, application data structure, familiar with hash data structure, bitmap data structure, and Bron filter. If the hash data structure, bitmap data structure and Bron filter
, click [Https://github.com/jacksparrowwang/cg19.github.com/tree/master/Data%20Structure] (GitHub)

Massive data processing problem, encountered problems generally have the file too large can not be loaded into memory, the contents of the file can not be quickly searched for content, there is a problem, how to statistics in the vast amount of data want content. There is fish in the north, its name is Kun, the big kun, a pot stew

It's like big data. So how do we deal with it. Instance to analyze

given a file larger than 100G in size, which has an IP address, find the most frequently seen IP address (hash file segmentation)

This is given a 100G file, which has a lot of content, but we want to find the IP that exists in it and is the most frequently occurring IP address. At this time someone will say, let the computer 1.1 point to find, but here 100G to all is IP address, then on the general computer is unable to load, even if there is that here we want to optimize in space. Here is like the north in front of the fish, its name is Kun, Kun of the big, a pot stew.
Well, then we'll have to figure it out, we can use Hachiche to divide the data, split the data into several points, and then load it into memory individually.

What is Hachiche points. Hachiche, is to use the hash value, to slice, and then put the same value into a set. For example, we now have to cut the 100g file into 100 parts (if it is still very large after 100 parts, we can cut into 1000 parts), then read each file content, calculate its hash value, let its modulus 100, if equal to 0, put into the set of No. 0, if equal to 1 into the set of 1th, Loop in turn. Then when all the elements are traversed, then 100 sets will be successful (the hashing algorithm is complex enough to calculate the hash distribution is evenly spaced). In this collection, the same IP must be in the same set. Because the same hash value for the same IP must be the same. By this time we are loading it into memory for statistics so that we can turn a pot stew problem into multiple pots and stew separately.

given 10 billion integers, find the integer that appears only once (bitmap deformation, expressed in two digits).

The main idea of this problem is to make space above optimization, find is very simple, but to make big data search, time, although let it longer, but the consumption of space is also huge, then how we reduce the space overhead.
So we are learning a data structure called a bitmap, the bitmap is actually the smallest space to mark the existence of the element to find. Here we want to find the integer that appears once, the bitmap is the use of a bit to indicate whether the existence, find whether there is a time then we can use two bit bit to represent, for example, 00 means no, with 01 means that the occurrence of 10 is not this data, This data definitely appears more than three times then we can continue to find.
The advantage of this is that the space overhead is greatly optimized, and we can make improvements to this idea, such as the need to figure out 5 occurrences of the data so that we can use a three bit bit to represent.

There are two files, each with 10 billion queries (query words, strings), only 1G of memory, find the intersection of two files (hash file segmentation + filter).

Previously mentioned for big data content to find statistics, we use is Hachiche points, Hachiche points of the main role is to greatly optimize the space consumption.
Here are two files are 10 billion query, then we in the statistical time, we use Hachiche division to divide, for example, we now want to divide the data into 100 parts then, take each data to hash value, then modulo 100, if the modulus is equal to 0, then we will classify it as a set of No. 0, Then insert it into the Bron filter, let the other file also Hachiche points, and cut into 100 parts, modulo 100, equal to 0, is attributed to the No. 0 set, and then the No. 0 set into the Bron filter in the element No. 0 to find, so we can find the intersection of element No. 0, Then we'll do the next set lookup, and so on ...

to thousands of files, each file size is 1k-100m, and the design algorithm finds out which files a word exists in (inverted index).

This problem is similar to some search engines, such as Baidu, when entering one or two keywords, will quickly return hundreds or thousands of file links. So how did they do it?
This reality uses the hash table, the key value pairs to carry on the inverted index, the hash table implementation, if does not know the hash table implementation to poke the above link. Hash table forward is a key pair with a value then we use the open, we give the file number, by the number we can find the corresponding value, which is a positive index, then we will now reverse, the file of the keyword as key, the number of the file as value, Then we can easily find the file.

About search engine keyword search, is divided into two kinds of servers, one is the online server, like now you open Baidu Search, one is offline server, why also have a offline server it. This is to achieve inverted index, but also in order to improve efficiency, such as Baidu, Baidu will regularly update the data, so that the new data will be added to crawl down, the corresponding keyword and number, and then let the online server to load, it can be. Hachiche of other uses

General Large company server, can not be a, then you will encounter a problem, if someone landed on the server, how to quickly find in the huge server community, and landing it. Then in the design of the server community, the Hachiche points, such as when you register the password to hash your account, and then through the module to the number of the account will be divided into which server to manage, so that you can greatly reduce the time spent in one-by-one search.

If there is any mistake, please correct it, thank you.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More