The classic Big Data problem

Source: Internet
Author: User

With the rapid development of information, more and more data information waiting to be processed, how quickly from these massive data to find the data you need it. This is the big data processing problem, I have a few classic big data problem analysis ~ ~ ~

first, the design algorithm to find the daily access to the most frequent Baidu IP address.

Analysis : Write all the IP to a large file, because when the IP address in dotted decimal notation is 32-bit, so there is a maximum of 2^32 IP. You can map the way, such as modulo 1000, the larger file map to 1000 small files, and then load each small file into memory to find the most frequent IP in each small file (you can use the idea of hash_map frequency statistics) And then in the 1000 largest IP to find the most frequent IP, is the most frequent IP.

The algorithm thought is as follows: ( divide and conquer +hash)

1). IP addresses are 2^32=4g, so you cannot load all IP addresses directly into memory

2. The idea of "divide and conquer" can be considered, that is, the IP address hash (IP)%1024 value, the massive IP stored in 1024 small files, so that each small file contains up to (2^32)/(2^10) =4m IP Address

3. For each small file, you can build an IP value of key, the number of occurrences of Vaue hash_map, through the comparison of value to find the most frequent in each file of the IP address

4. After the above steps have been 1024 occurrences of the most IP address, and then select a certain sorting algorithm to find the 1024 IP in the most frequent IP address

two. To two files, there are 10 billion integers, we only have 1G of memory, how to find the intersection of two files.

Analysis : We know that for shaping data, whether signed or unsigned, a total of 2^32=4g data (10 billion data must have duplicate data), we can use a bitmap to solve, If we use a bit to represent a shape data, that Mulao 4G the total number of 512M memory. Our approach is to map the data in the first file into the map, and then compare the data in the second file with the data in the first file, and the same data is the intersection (duplicate data, only one occurrence in the intersection).

Three. Assume that a file has 10 billion shaping data, 1G of memory, and how to find numbers that occur no more than two times.

analysis : To solve this problem also need to use the idea of a bitmap , in question two have learned that the use of a bitmap bit can determine whether the data exists, that Mulao find the number of not more than two times the use of a bit is impossible to solve, Here you can consider using a two-bit bitmap to solve.

According to the above analysis, we can use two bits, to indicate the existence and number of occurrences of a number, such as: 00 indicates that there is no, 01 is present, 10 is two, and 11 is more than two; the computation of a similar problem two: If a number occupies one digit, it requires 512M of memory, However, if a number occupies two digits, you need (2^32)/(2^2) =2^30=1g memory, and all data maps to the map in place to find the corresponding number is not 11 to solve the problem.

Topic Extension: Other conditions unchanged, if only given 512M of memory how to find the number of occurrences not more than two times.

Analysis: The data batch processing, if given is a signed number, the first solution to positive numbers, and then solve the negative, at this time 512M just solve the above problems.

four. Two files, 10 billion query, we only have 1G memory, how to find two file intersection. The exact algorithm and approximate algorithm are given respectively.

Analysis: The first thing to see the string should be Prum filter, and problem four approximation algorithm is the use of the cloth-lung filter method, the reason that the Prum filter is an approximate algorithm, because it exists a certain misjudgment ( does not exist is certain, the existence is not certain ) To accurately determine the intersection of string files, we can use the divide and conquer method: The large file cut into a small file, one after another small file to the memory to do the comparison, find the corresponding intersection.

1. Approximate solution of Prum filter:

Based on the different string hashing algorithms, you can calculate different key values and then map them so that they can be mapped to different locations, and this string is only possible if the bits are all 1 (because the same bits may be mapped when there are too many strings), and only one bit is 0, That Mulao the string must not exist, so the Prum filter is an approximate solution. Maps the first file to the Prum filter. Then compare each string in the second file (compute the key for a particular string, map out different bits with different hashing algorithms, and if all is 1, consider the string to be the intersection of two files); If there is a 0 that mulao the string must not intersect.

2. the precise solution of Hachiche points:

Since it is called segmentation, as the name implies is a large file cut into small files, that mulao how to slice. The basis of the segmentation is what mulao. If we can split the same or the same file into the same file in the Mulao is not faster to find the intersection of the speed of it. The answer is yes.

Know how to deal with the basis of Hachiche points. We can get the key of the string based on a hash algorithm of the string, and then the number of files to be split (assuming 1000 files, the file number is 0~999), and we put the same string in the same file as the result ( The strings in two files are divided into the same file by the same hash algorithm, so we just need to compare the same file with the subscript.

The Hachiche score is obviously more efficient than the Prum filter, and the time complexity is O (N).

Bloomfilter with delete features:

struct __HASHFUNC1 {size_t bkdrhash (const char *str) {Register size_t hash = 0;
			while (size_t ch = (size_t) *str++) {hash = hash * 131 + ch; 
		can also be multiplied by 31, 131, 1313, 13131, 131313.
	return hash;
	} size_t operator () (const string& str) {return Bkdrhash (Str.c_str ());

}
};  
		struct __HASHFUNC2 {size_t sdbmhash (const char *str) {Register size_t hash = 0;         
			while (size_t ch = (size_t) *str++) {hash = 65599 * hash + ch;  
		hash = (size_t) ch + (hash << 6) + (hash <<)-hash;  
	return hash;
	} size_t operator () (const string& str) {return Sdbmhash (Str.c_str ());

}
};  
		struct __HASHFUNC3 {size_t rshash (const char *str) {Register size_t hash = 0;     
		size_t magic = 63689;  
			while (size_t ch = (size_t) *str++) {hash = hash * magic + ch;  
		Magic *= 378551;  
	return hash; } size_t operator () (const string& str) {return Rshash (Str.c_str ());
	}
};  
		struct __HASHFUNC4 {size_t aphash (const char *str) {Register size_t hash = 0;  
		size_t ch;  for (long i = 0; ch = (size_t) *str++ i++) {if ((I & 1) = = 0) {Hash ^= (hash << 7) ^ CH ^  
			(Hash >> 3));  
			else {hash ^= ((hash << one) ^ ch ^ (hash >> 5));  
	} return hash;
	} size_t operator () (const string& str) {return Aphash (Str.c_str ());

}
};  
		struct __HASHFUNC5 {size_t jshash (const char *str) {if (!*STR)//This is added by myself to ensure that the empty string returns a hash value of 0 return 0;  
		Register size_t hash = 1315423911;  
		while (size_t ch = (size_t) *str++) {hash ^= (hash << 5) + ch + (hash >> 2));  
	return hash;
	} size_t operator () (const string& str) {return Jshash (Str.c_str ());

}
};
Template<class k=string, Class Hashfunc1=__hashfunc1, Class Hashfunc2=__hashfunc2, class Hashfunc3=__hashfunc3, Class Hashfunc4=__hashfunc4, Class Hashfunc5=__hashfunc5> class Bloomfilter {public:bloomfilter (size_t num): _bitmap (num*5), _range (num*5) {} void Set (const
		k& key) {size_t hash1=hashfunc1 () (key)%_range;
		size_t Hash2=hashfunc2 () (key)%_range;
		size_t hash3=hashfunc3 () (key)%_range;
		size_t Hash4=hashfunc4 () (key)%_range;
		size_t Hash5=hashfunc5 () (key)%_range;
		_bitmap.set (HASH1);
		_bitmap.set (HASH2);
		_bitmap.set (HASH3);
		_bitmap.set (HASH4);

		_bitmap.set (HASH5);
		cout< 

A different hash algorithm is referenced in the above-implemented filter, and children's shoes that want to study the hashing algorithm can refer to the following links:

Http://www.cnblogs.com/-clq/archive/2012/05/31/2528153.html

Five. There is a dictionary, containing n English words, now arbitrary to a string, design algorithm to find all the English words containing this string!

Analysis: dictionary tree , to find the words to meet the requirements we need to traverse the dictionary tree one layer at a time, find the same position in each layer as the first letter of the given word, and traverse the search.

The source of the above big data problem refer to nine chapters algorithm seventh

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.