Big Data Topic thought Summary

Last Update:2016-11-08 Source: Internet

Author: User

Tags file info

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1) to a log filethatis larger than 100G in size , with an IP address stored in thelog , Design algorithms find the most frequently occurring IP addresses? (with how to know the top K IP, how to use the Linux System command Implementation)

Hash Split-barrel method:

Divide the 100G file into parts and map each IP address to the appropriate file:file_id = hash (IP)% 1000

in each file, the highest frequency is calculated separately. IP, and then merge the Hash of the bucket method;

Use Hash Bucket method to distribute the data to different documents;

each file counts top K respectively;

2) given billions of integers, the design algorithm finds an integer that appears only once.

Hash of the bucket method, the millions of integers mapped to different intervals, in each interval to find only one occurrence of the integer.

3) to two files, respectively, there are Billions of integers, we only have 1G of memory, how to find the intersection of two files

scan for each integer to see if it has occurred, save memory method useBitmap. Barrels of+ Bitmap. If the integer is32bit, directly usingBitmapimplementation of the method. All integers have a total2^32each number is expressed in two digits,xxindicates that none of the files have appeared,Tenrepresents a file1have appeared, onrepresents a file2have appeared, Oneindicates that all two files have appeared, requiring2^32*2/8 = 1GBmemory, traverse all the integers in two files, and then look forBitmapin Onethe corresponding integer is the intersection of two files so that the linear time complexity can be completed.

4)1 files with billions of int,1G of memory, Design algorithm to find all integers with large occurrences more than 2 times.

Bitmap extension: with 2 bit indicates a state , 0 indicates not appearing, 1 appeared 1 times, 2 2 times or more.

5) to two files, respectively, there are millions of query, we only have 1G of memory, how to find the intersection of two files? The exact algorithm and approximate algorithm are given respectively.

Precise algorithm: The method of Hash and Barrel

Hash the query in two files to N small files and indicate The source of the query;

find coincident query in each small file

Coincident query totals that will be found

Approximate algorithm: Bloomfilter Algorithm

6) How to extend the bloomfilter so that it supports the operation of deleting elements

will be each bit in the bloomfilter is extended to a counter that records how many hash functions are mapped to this bit, and when deleted, only if the reference count becomes 0 . The location is really 0.

7) How do I extend the bloomfilter so that it supports count operations?

will be each bit in the Bloomfilter is expanded to a counter, each INPUT element is to add 1to the corresponding position, thus supporting the counting operation. The count number is the minimum value for all the locations to which the map is mapped.

8) Give thousands of files, each file size is 1k-100m. Give n words, design algorithms for each word find all the files that contain it, you only have 100K of memory.

0: Use a file info to save n words and file information containing them.
1: First divide n words into x parts. For each copy, generate a fabric filter (because the memory may not be sufficient for n-word generation only one of the fabric filters). Save all the generated external memory filters in a file filter.
2: The memory is divided into two buffers, one for each read into a filter, a read file (read the buffer using the equivalent of a consumer problem model to achieve synchronization), large files can be divided into smaller files, However, it is necessary to store the marked information of the large file (such as which large file is the small file).
3: For each word read in the memory of the filter to determine whether to include this value, if not included, read from the filter file the next Bron filter to memory, until the inclusion or traversal of all the filter. If included, update the info file. Until all data is processed. Delete the filter file.

remark:
1: About Bron filter: In fact, is a bitmap to store string hash value.
2: There may be some detail problems, such as repeating the string caused by repeated calculations, etc. to consider.

9) A dictionary containing N English words, now arbitrary to a string, the design algorithm to find all the English words containing this string.

to lose into a string, using the alphabet to create an inverted index, which stores the word in the index and the position in the word, and, when queried, finds all the words in the inverted row and the intersection and position to be contiguous.

Big Data Topic thought Summary

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More